[PDF] HMCF - Hamiltonian Monte Carlo Sampling for Fields - A Python framework for HMC sampling with NIFTy

Abstract

HMCF "Hamiltonian Monte Carlo for Fields" is a software add-on for the NIFTy "Numerical Information Field Theory" framework implementing Hamiltonian Monte Carlo (HMC) sampling in Python. HMCF as well as NIFTy are designed to address inference problems in high-dimensional spatially correlated setups such as image reconstruction. HMCF adds an HMC sampler to NIFTy that automatically adjusts the many free parameters steering the HMC sampling machinery. A wide variety of features ensure efficient full-posterior sampling for high-dimensional inference problems. These features include integration step size adjustment, evaluation of the mass matrix, convergence diagnostics, higher order symplectic integration and simultaneous sampling of parameters and hyperparameters in Bayesian hierarchical models.

Full PDF

JJSS

Journal of Statistical Software

MMMMMM YYYY, Volume VV, Issue II. doi: 10.18637/jss.v000.i00

Hamiltonian Monte Carlo Sampling for Fields

Christoph Lienhard Torsten A. Enßlin

Abstract

HMCF “Hamiltonian Monte Carlo for Fields” is a software add-on for the

NIFTy “Nu-merical Information Field Theory” framework implementing Hamiltonian Monte Carlo(HMC) sampling in

Python . HMCF as well as

NIFTy are designed to address inferenceproblems in high-dimensional spatially correlated setups such as image reconstruction.They are optimized to deal with the typically high number of degrees of freedom.

HMCF adds an HMC sampler to

NIFTy that automatically adjusts the many freeparameters steering the HMC sampling machinery. A wide variety of features ensureeﬃcient full-posterior sampling for high-dimensional inference problems. These featuresinclude integration step size adjustment, evaluation of the mass matrix, convergence di-agnostics, higher order symplectic integration and simultaneous sampling of parametersand hyperparameters in Bayesian hierarchical models.

Keywords : Python , Bayesian statistics, ﬁeld inference, Hamiltonian sampling.

1. Introduction

HMCF implements a Hamiltonian Monte Carlo (HMC) sampler (Duane, Kennedy, Pendleton,and Roweth 1987) for the

NIFTy (“Numerical Information Field Theory”) framework (Selig,Bell, Junklewitz, Oppermann, Reinecke, Greiner, Pachajoa, and Enßlin 2013; Steininger,Dixit, Frank, Greiner, Hutschenreuter, Knollmüller, Leike, Porqueres, Pumpe, Reinecke et al.

Python3 on Unix-like systems. The mainpurpose of

HMCF is to create samples which are distributed according to arbitrary, once-diﬀerentiable probability distributions. These samples can, for example be used to approxi-mate expectation values in cases where brute-force integration is infeasible.

NIFTy is a

Python library developed for computational work with information ﬁeld theory(IFT, Enßlin, Frommert, and Kitaura (2009); Enßlin and Weig (2010)). IFT extends clas- a r X i v : . [ phy s i c s . d a t a - a n ] D ec HMC Sampling for Fields sical probability theory onto functional spaces.

NIFTy is interesting for spatially correlatedinference problems such as image reconstruction (Selig, Vacca, Oppermann, and Enßlin 2015;Junklewitz, Bell, Selig, and Enßlin 2016; Pumpe, Reinecke, and Enßlin 2018) or work ongeospatial datasets (Daniel Buscombe 2016). A main advantage is the resolution-independentcalculation of statistical estimates.With

HMCF , Bayesian models already implemented in

NIFTy can be reused for an HMCsampling approach, easily. This can help estimating the impact of approximations present inother approaches, or enable tackling entirely new problems.

At the heart of

HMCF lies the

HMCSampler class which constructs an HMC sampler basedon a predeﬁned

NIFTy

Energy class describing a probability distribution P ( x ) as an energy Ψ( s ) = − log( P ( x )). Samples drawn from this distribution are saved to the disk as HDF5 archives (The HDF Group 2018).To ensure a successful sampling process

HMCF implements a variety of additional features.The sampler calculates a convergence measure to determine when the burn-in phase hasﬁnished. Several predeﬁned strategies on how to exactly calculate the measure are availableand can be chosen from. It is critical for an HMC sampler to use a proper mass matrix whichis why

HMCF can recalculate it several times during burn-in achieving better performanceby orders of magnitude in comparison to the usage of an identity as mass matrix. Anotherimportant sampling parameter, the integration step size of the symplectic integrator, is alsoadjusted such that a predeﬁned acceptance rate is matched. Again, the exact adjustingstrategy can be chosen from a predeﬁned set of options. Although for most purposes asecond order leapfrog integrator is suﬃcient, higher order integrators are available as well.Furthermore,

HMCF uses multi-processing in that individual Markov chains use separatecores.

HMCF is optimized to ease the work with HMC sampling. All of the above can be done inonly a few lines of code if a well-deﬁned

NIFTy

Energy class is available.

There are many software packages for HMC sampling available in many diﬀerent languages.But unlike

HMCF most packages are static in that they use in general the identity as themass matrix or need a mass matrix speciﬁed in the beginning. Since especially in high-dimensional cases a good mass matrix estimation is crucial for a successful sampling processwe will concentrate on packages which estimate the mass matrix.A very popular cross-platform package for HMC is

Stan (Stan Development Team 2017).

Stan provides interfaces for R , Python , shell , MATLAB , Julia , Stata , Mathematica and

Scala .Its biggest advantage is the

C++ back-end which makes it by far the fastest sampler if thesame parameters such as integration step size and mass matrix are chosen. Another notableadvantage over

HMCF is an implementation of the no-u-turn sampler (NUTS, Hoﬀman andGelman (2014)) which can be seen as an extension to the standard HMC approach.In

Stan the mass matrix is set to the identity initially, but is recalculated from the generatedsamples during the burn-in phase. The mass matrix can but does not have to be restrictedto a diagonal matrix.

HMCF diﬀers in that the user is able to deﬁne their own mass matrix ournal of Statistical Software

Stan developers announced such a feature in future versions, though. Using the samplesgenerated by the initial chain itself involves the risk of having highly correlated samples incase the sampler was malfunctioning due to the usage of the initial mass matrix. To avoid this,

HMCF uses samples drawn from the curvature of the full posterior at the current position toreevaluate the mass matrix. We found this approach to be much more eﬃcient. Reevaluatedmass matrices are always diagonal in

HMCF but since it is targeted at high-dimensionalproblems where non-sparse matrices can not be represented explicitly this is not really adisadvantage. Furthermore, more recent

NIFTy based algorithms use harmonic space degreesof freedom as ﬁeld variables (Knollmüller, Steininger, and Enßlin 2017) which ﬁts better to amass matrix being diagonal in these ﬁeld parameters.Another important package for HMC sampling in

Python is pyMC3 (Salvatier, Wiecki, andFonnesbeck 2018). pyMC3 provides a huge variety of diﬀerent samplers among other functionsfor statistical applications. When it comes to the HMC sampler in pyMC3 the main diﬀerenceto HMCF is that the mass matrix is again evaluated based on the samples of the Markovchain itself which might be problematic as described in the paragraph above. Again pyMC3 has a NUTS feature.Apart from that the main advantage of

HMCF is that it is easy to use for already writtenalgorithms in

NIFTy and its optimization for high-dimensional statistical problems.

This introduction is followed by a short installation guide. In section 3 we give an introductionto HMC sampling on a theoretical / mathematical level. Afterwards, we illustrate the workwith

NIFTy and

HMCF using a simple Bayesian hierarchical model as an example in section4. This document ends with a short summary in section 5 on why there is a need for a distinct

HMCF package.

2. Installation

HMCF relies on the following other

Python packages:

NumPy : The very basic

Python package for numerical analysis on multidimensional arrays.

SciPy (Oliphant 2007) : Library implementing advanced algorithms and functions in

Python . h5py : A Python wrapper for the HDF5 ﬁle format.

NIFTy (4.1.0 or newer, Steininger et al. (2017); Martin Reinecke (2018)) : A package forstatistical ﬁeld inference problems. matplotlib (Hunter 2007), optional : A package for producing ﬁgures. Necessary for the

HMCF tools sub-package.

PyQt5 optional : Necessary for the tools sub-package.

HMC Sampling for Fields

NIFTy supports multi-processing in many calculations via mpi4py (Dalcìn, Paz, and Storti(2005)) but

HMCF needs to restrict each individual Markov chain to one core. If mpi4py is already installed the user should switch multi-processing oﬀ by setting the environmentvariables

MKL_NUM_THREADS and

OMP_NUM_THREADS to 1 in a terminal: export MKL_NUM_THREADS = 1export OMP_NUM_THREADS = 1

Installing

HMCF along with all required packages is possible via pip install git+https://gitlab.mpcdf.mpg.de/ift/HMCF

3. Hamiltonian Monte Carlo Sampling

The goal of most statistical problems is to ﬁnd expectation values h f ( x ) i P ( x ) of a function f ( x ) given a distribution P ( x ), with h f ( x ) i P ( x ) = Z X f ( x ) P ( x ) dx , where X is the space of all possible values for x . However, especially in high dimensionalcases the integral may become intractable. One approach to circumvent this problem is touse a form of Monte Carlo integration. Samples ( x i ) which are distributed like P ( x ) are usedto approximate the expectation value: h f ( x ) i P ( x ) ≈ N N X i =1 f ( x i ) (1)The law of large numbers ensures that this approximation converges to the true expectationvalue in non-pathological situations. Then, the problem is reduced to ﬁnding a strategy togenerate the samples ( x i ). While there are straight forward solutions in some cases such asnormal distributions, often more advanced algorithms are needed. A large subset of suchalgorithms are Markov chain Monte Carlo (MCMC) methods which have in common thata Markov process is constructed which ultimately converges to the wanted target distribu-tion P ( x ). HMC sampling is a MCMC method especially applicable for once-diﬀerentiableprobability distributions on high-dimensional spaces. The Hamilton Monte Carlo (HMC) approach (ﬁrst introduced by Duane et al. (1987), goodsummaries: Neal et al. (2011); Betancourt (2017)) uses a variation of the Metropolis-Hastingsalgorithm (Hastings 1970; Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller 1953) withless random walk behavior. The idea is to describe the Markov process as a physical Hamilto-nian time evolution, exploiting for MCMC methods advantageous properties of the dynamicsof this system. The samples ( x i ) can then be thought of as snapshots of the trajectory of ournal of Statistical Software

5a particle moving through an energy landscape Ψ( x ). This energy is deﬁned by the targetdistribution P ( x ) via Ψ( x ) := − log( P ( x ))This convention originates from statistical physics where the probability of a system being instate i with energy E ( i ) is P ( x ) ∝ exp( − βE ( i ))where β is a temperature-dependent scaling parameter.Additionally, a new normal distributed random variable p ∈ X with covariance M called momentum is introduced. The negative logarithm of the joint distribution P ( x, p ) then lookslike a typical physical Hamiltonian: H ( x, p ) = 12 p > M − p + Ψ( x ) + constThe central idea of HMC sampling is to evolve this system in time according to Hamilton’sequations of motion ˙ x k = ∂H∂p k = h M − p i k ˙ p k = − ∂H∂x k = − ∂ Ψ ∂x k (2)for k = 1 , . . . , dim( X ). After some time T the position we arrived at is considered as anew Metropolis-Hastings proposal ( x ( T ) , p ( T )) =: ( x , p ). This is approach is possible sinceHamiltonian dynamics have some convenient properties such as volume preservation. Also, intheory, the new sample is exactly as probable as the starting point since the process is energyconserving. In practice however, the discretization of the integration in time leads to errorsin the energy conservation, which is why a accept-reject step is necessary where the proposalis accepted with probability ρ A = min (1 , exp( − ∆ E )) , (3)where ∆ E = H ( x , p ) − H ( x, p ).The whole algorithm then looks like this:Set initial x for i = 1 to do Generate momentum sample p ∼ N (0 , M )Evolve system for time T New position: ( x , p ) = ( x ( T ) , p ( T ))  MH proposalGenerate sample r ∼ U ([0 , if r ≤ ρ A then Set x i = x else Set x i = x i − endend The resampling of p in each iteration ensures that the whole parameter space X can bereached. At this point we omit the full proof that the samples ( x i ) are then distributed like P ( x ). See Neal et al. (2011) for details. HMC Sampling for Fields

For HMC to work, the integrator for the time evolution of the system needs to be symplectic.Most of the time the leapfrog integrator is used, which is a second order symplectic inte-grator. Symplectic integrators of higher orders based on work presented in Yoshida (1990)are possible as well in

HMCF . They have proven to be advantageous for HMC sampling inhigh-dimensional non-linear cases (Blanes, Casas, and Sanz-Serna 2014).One step in time of length (cid:15) with the leapfrog integrator is calculated via p (cid:18) t + (cid:15) (cid:19) = p ( t ) − (cid:15) ∂ Ψ ∂x (cid:12)(cid:12)(cid:12)(cid:12) x ( t ) x ( t + (cid:15) ) = x ( t ) + (cid:15)M − pp ( t + (cid:15) ) = p (cid:18) t + (cid:15) (cid:19) − (cid:15) ∂ Ψ ∂x (cid:12)(cid:12)(cid:12)(cid:12) x ( t + (cid:15) ) . (4)This single leapfrog step is applied L times to generate a new sample such that T = (cid:15)L . Theintegration step size (cid:15) determines the overall acceptance rate of the sampling process. Theadvantages of HMC are only present if the acceptance rate is in the range of 0.6 to 0.9 (Beskos,Pillai, Roberts, Sanz-Serna, Stuart et al. (cid:15) to the acceptance rate we developed an approximation further discussed in appendixA.Finally, for an HMC sampling process to work properly it is crucial to ﬁnd a good massmatrix M , which serves as covariance of the momentum p . There are several approaches butone very popular strategy is to use samples from the chain itself and base the mass matrixon the variance of these samples. M − = 1 N N X i =1 (cid:16) ( x i − µ )( x i − µ ) > (cid:17) (5)with µ being the mean value. However, in speciﬁc cases other approaches might be better,for example as documented in Taylor et al. (2008). We found that using samples from thechain itself is unfeasible in high-dimensional cases (dim( X ) > HMCF the samples are generated by drawing samples from the curvature of Ψ( x )at the current position of the chain. In other words: For the purpose of estimating the massmatrix, we approximate the target distribution P ( x ) with a normal distribution and thendraw samples from this distribution.

4. Using

HMCF

In this section we try to provide some intuition in working with

HMCF . It is assumed thatthe reader is familiar with

Python .We will use a simple Bayesian hierarchical model as an example to introduce the most impor-tant functionalities of

HMCF . For a comprehensive documentation of all features see appendixB. An implementation of this example is part of the

HMCF repository and can be found in demos/multi_field_demo.py . We ﬁrst introduce the model on a theoretical level, then give ournal of Statistical Software

7a short overview on how the model is implemented within

NIFTy in section 4.2. In section4.3 we show how sampling with

HMCF works and ﬁnally introduce features for retrievingand displaying the results in section 4.4.

Consider a telescope observing the large-scale structure of the universe. Billions of galaxiesproducing photons eventually reaching the earth. This photon ﬂux reaching the sensors ofthe telescope is denoted x . The spatial properties of x can be described as a mathematicalﬁeld living on a continuous space.Our telescope measures this photon ﬂux x which means that it converts it into an electronicsignal based on how many photons reach each of its CCD sensors. Errors in lenses and smalldiﬀerences between individual sensors alter the true nature of the original x ﬁeld but canbe coped with by calibration. In our model, this transformation is represented by a linear response operator R acting on x . Note, that the domain and the target domain of R arediﬀerent. Whereas, in theory, the domain of x is continuous, the output of the telescope isdeﬁnitely discretized. Additionally to the response of the instrument, we assume a certainGaussian noise due to imperfect electronics with known covariance N . The measured data d produced by the telescope is then described by the measurement equation d = R ( x ) + n . (6)What we are interested in is the “true” signal, based on the data d we got from the telescope.This information is reﬂected in the conditional probability P ( x | d ). In terms of Bayesianstatistics this is often referred to as the posterior and Bayes formula relates the posterior toour assumed model and prior knowledge we may have on our signal x : P ( x | d ) ∝ P ( d | x ) P ( x )While the likelyhood P ( d | x ) is deﬁned by our model in equation (6) and the noise statistics,the prior P ( x ) needs more consideration. Observations dating back to Hubble (1934) suggestthat for the large-scale structure, at least to some extend, a log-normal prior is suﬃcient. Wetherefore introduce another ﬁeld s = log( x ) with a multivariate Gaussian distribution andcovariance S such that x is log-normal distributed. We further want to ensure that our ﬁeld s is somewhat smooth. This can be enforced by imposing a power law P ( k ) on the covariance S in its harmonic representation, S kk = δ k,k P ( k ) . This power law can be chosen such that high curvatures in the position space get punishedand are therefore improbable. For illustration, we assume a power law P ( k ) = (cid:18) l c l c k (cid:19) (7)with the correlation length l c essentially deﬁning how strong curvature is allowed to be.If we do not know the correlation length, we can treat it as a free hyperparameter making theproblem a full Bayesian hierarchical model. Since l c is strictly positive we assume anotherlog-normal prior P ( l c ) ∝ l c exp − (log l c − µ c ) σ c ! HMC Sampling for Fields where µ c and σ c are appropriate parameters for this hyperprior.Our full posterior then turns into P ( s, l c | d ) ∝ P ( d | s ) P ( s | l c ) P ( l c ) (8)with the likelihood P ( d | s ) = Z δ ( d − Re s − n ) P ( n ) dn ∝ exp (cid:18) −

12 ( d − Re s ) > N − ( d − Re s ) (cid:19) and the prior for s P ( s | l c ) = | πS | − exp (cid:18) − s > S − s (cid:19) The dependence on l c is encoded in the covariance S . The norm of S is deﬁned as thedeterminant | S | = det S = Y k P ( k )However, for HMC sampling we need a potential Ψ( s, l c ) i.e., the negative logarithm of theposterior in (8). For better readability, we divide the potential in a prior and a likelihoodpart as well, such that Ψ( s, l c ) = Ψ l ( s, l c ) + Ψ p ( l c ) (9)with Ψ l ( s, l c ) = − log P ( d | s ) − log P ( s | l c )Ψ p ( l c ) = − log P ( l c )We omit terms constant in l c and s since they are not important for HMC sampling andarrive at Ψ l ( s, l c ) = 12 ( d − Re s ) > N − ( d − Re s ) + 12 s > S − s + X k log( P ( k )) ! Ψ p ( l c ) = 12 (cid:18) σ c (log l c − µ c ) (cid:19) + log( l c ) (10)Additionally, the gradient of the potential Ψ( s, l c ) is needed for the time evolution part duringHMC sampling. For the likelihood part the gradient boils down to the following expressions: ∂ s Ψ l = S − s − ( Re s ) (cid:12) N − ( d − Re s ) ∂ l c Ψ l = s > (cid:16) ∂ l c S − (cid:17) s + 4 X k (cid:18) l c − k l c k (cid:19) (11)where (cid:12) denotes point-wise multiplication of vectors. For deriving the exact expression for ∂ l c S − , observe that (cid:2) S − (cid:3) kk = δ kk ( P ( k )) − and therefore h ∂ l c S − i kk = − l c k ) l δ kk ournal of Statistical Software ∂ s Ψ p = 0 ∂ l c Ψ p = 1 l c (cid:18) σ c (log l c − µ c ) − (cid:19) . (12) NIFTy

In this section we will take a look at how such a model can be implemented in

NIFTy ingeneral. For more details on

NIFTy see Martin Reinecke (2018).In

NIFTy there are a variety of classes representing certain aspects of a typical inferenceproblem, among which the most important ones are:

Domain : A base class representing the underlying space. For example a regular grid can bedeﬁned as a

RGSpace which inherits from

Domain . Field / MultiField : A class representing ﬁelds such as x , n or d . It carries informationabout the underlying Domain as well as the ﬁeld’s value on this domain. The

Field supports every arithmetic operation. Other operations such as trigonometric or expo-nential functions can be applied point-wise using e.g., ift.exp(x) . The notation forthese functions is the same as the one used by

NumPy . MultiField s are essentiallydictionaries of

Field s which also support all operations above. This can be used torepresent all free parameters in a hierarchical model in just one object.

LinearOperator : An abstract base class for explicitly or implicitly deﬁned linear opera-tors such as the response R . The LinearOperator class carries information aboutthe operator’s

Domain as well as its target domain. The operator can be applied to a

Field x in various ways by calling one of the methods times(x) , inverse_times(x) , adjoint_times(x) , or adjoint_inverse_times(x) , although not every of these meth-ods is available for every linear operator. Energy : An abstract base class representing the negative logarithm of a distribution P ( x ).The Energy class is only deﬁned at a certain value of a

Field or MultiField but thesame energy for a diﬀerent

Field y on the same

Domain can be obtained by calling the at(y) method. Additionally, the

Energy class deﬁnes a gradient at the position as wellas the curvature. This class is the most important one for

HMCF since it deﬁnes thepotential Ψ( s, l c ) and thereby the whole model.The model introduced in the previous section can be implemented as an NIFTy

Energy butsince this paper is about

HMCF we will not go into detail with this. The curious readercan, however, look into the demonstration script and will ﬁnd the implementations of thelikelihood and the prior energy from (10) along with their respective gradients in (11) and(12) in /demos/energies.py in the package’s repository.

Implementation of the Hierarchical Model in

NIFTy

This and the following section summarize the demos/multi_field_demo.py script. Thereader may want to have a look at the script, to have an overview.We ﬁrst import

NIFTy and

HMCF among other packages via0

HMC Sampling for Fields import nifty4 as iftimport hmcf

A mock dataset for our algorithm to operate on is generated using the get_hierarchical_data function which was written for this purpose only. d, sl, R, N = get_hierarchical_data()

It returns

NIFTy objects for the data ﬁeld d , the free parameters s and l c (as the MultiFieldsl ), the response R and the noise covariance N . d is a Field containing the data, sl is a MultiField with entries ’signal’ for s and ’l_c’ for l c . The mock dataset is generatedby ﬁrst sampling a signal ﬁeld s with the power spectrum deﬁned in (7) and a pre-deﬁnedvalue for l c . Together with a noise drawn from a normal distribution with covariance N themeasurement equation (6) is applied and the mock data set is available. The signal spaceas well as the data space consist of 100 ×

100 pixels which means that, together with l c , ourproblem has 10 ,

001 free parameters. In ﬁgure 1 this mock data set generated with randomseed 41 can be observed. For this simple example the instrument R transforms the photon ﬂuxperfectly to an electrical signal, except for a square in the bottom right region of the imagewhere the instrument is broken and just returns zeros. For our sampling process we wrote x R(x) d 0.50.00.51.01.52.02.53.0 Figure 1: Output of the get_hierarchical_data function. The ﬁrst image on the leftdisplays the original photon ﬂux x before it hits the detector. For this simple example theinstrument transforms the photon ﬂux perfectly to an electrical signal, except for a square inthe bottom left region of the image where it is broken. The instrument returns just zeros inthat region. The resulting image is displayed in the middle ﬁgure. Finally in the right ﬁgure,the data ﬁeld d is displayed where Gaussian random noise was added.an NIFTy

Energy class called

SimpleHierarchicalEnergy which implements the potentialΨ( s, l c ) described in (9). The constructor needs the following arguments: position : NIFTy

MultiField

The position ( s, l c ) where the energy Ψ( s, l c ) is evaluated. The MultiField has twoentries: ’signal’ and ’l_c’ . The

NIFTy

Field in ’signal’ is the harmonic repre-sentation of the s parameter. ournal of Statistical Software d : NIFTy

Field

The data vector. R : NIFTy

LinearOperator

The instrument’s response. N : NIFTy

LinearOperator

The covariance of the noise. l_c_mu : float Hyperparameter for the log-normal distribution of l c . l_c_sig2 : float Hyperparameter for the log-normal distribution of l c . HT : NIFTy

LinearOperator

Harmonic transformation operator, capable of transforming the position[’signal’] ﬁeld from the harmonic representation to the position space. inverter : NIFTy

Minimizer

Numerical method for inverting

NIFTy

LinearOperator s. This is necessary to be ableto sample from the curvature of the energy which is required if the mass matrix issupposed to be reevaluated.Using a

MultiField as position enables us to sample the signal and the hyperparameter l c at the same time. d , R and N are already given by the get_hierarchical_data function.Values for l_c_mu and l_c_sig2 are chosen such that values for l c in the range of 0 .

05 to 10 . are probable (the true value used during data generation is 0 . s_space = sl['signal'].domainHT = ift.HarmonicTransformOperator(s_space) and as the inverter needed for the mass reevaluation we use a conjugate gradient implemen-tation in NIFTy : ICI = ift.GradientNormController(iteration_limit=2000,tol_abs_gradnorm=1e-3)inverter = ift.ConjugateGradient(controller=ICI)

Finally, for the initial deﬁnition of the energy we use the sl as position since it has the correct MultiField structure. An instance of the energy for our model is then created by calling theconstructor energy = SimpleHierarchicalEnergy(sl, d, R, N, HT, -0.3, 2.,inverter=inverter)

HMCF HMC Sampling for Fields

Up to this point we only introduced and used

NIFTy objects. But now that an

Energy isproperly deﬁned we can start using

HMCF . First, we create an instance of the

HMCSampler class. This object represents the HMC sampler and the constructor has only one required ar-gument: The potential Ψ. However, for this example we also set the optional num_processes argument which states the number of chains or CPU cores we use during sampling and the sample_transform argument which is necessary since we sample s but are mainly interestedin the photon ﬂux x = exp( s ). def sample_trafo(s):val = dict(s)val['signal'] = ift.exp(HT(val['signal']))return ift.MultiField(val=val)sampler = hmcf.HMCSampler(energy, num_processes=6,sample_transform=sample_trafo) The sample_transform argument requires a function and represents essentially f in (1). It isimportant that the sample_transform function takes NIFTy

Field s or

MultiField s of thesame kind as the position argument of the

Energy instance which is passed to the constructorof the

HMCSampler class (in particular they need to live on the same domain). The output ofthe function can be any kind of

Field or MultiField regardless of what was put in.Before we start sampling we need to deﬁne initial positions for the Markov chains. In principlethis could be completely random but unfortunately, in high-dimensional cases we need tobe somewhere close to non-vanishing parts of the target distribution because otherwise thegradients are so large that numerical integration during time evolution will fail in any case. Inthis example we know the true signal sl and will use small derivations from that, but underreal circumstances one would need to ﬁrst use an approximating algorithm to get close to themean or maximum-a-posteriori solution of the problem and start from there. This can bedone using algorithms already implemented in NIFTy . Afterwards, we call the run method ofour sampler which has again only one required argument: The number of samples per Markovchain drawn after the burn-in phase has ﬁnished. Additionally, we set the optional argument x_initial with a list of initial positions (the length of this list does not have to be the sameas the number of Markov chains). x_initial = [sl*c for c in [.5, .7, 1.5, 2.]]sampler.run(500, x_initial=x_initial)

This will initiate a sampling process where the sampler starts in a burn-in phase during whichthe integration step size (cid:15) from (4) is adjusted to meet the default target acceptance rate of0 .

8. Additionally, the mass matrix is reevaluated once in the beginning and then the samplerwaits until the individual Markov chains have converged with respect to a measure based ondiagnostics ﬁrst introduced by Gelman and Rubin (1992). All these things can be adjustedto the users needs and speciﬁc problem and a detailed description of all options can be foundin appendix B. The whole sampling process may take up to 10 minutes depending on themachine the script is executed on. If the sampling process seems to freeze in the beginningthis is probable due to the mass reevaluation which can take some time. A much shorterexecution time is possible by setting the n_pixels argument of the get_hierarchical_data function to . ournal of Statistical Software After some time the sampler is ﬁnished and one can access the mean value (of the transformedsamples) as a

NIFTy

MultiField via the mean attribute of the sampler . To get the photonﬂux values as a

NumPy array one can write mean_val = sampler.mean['signal'].to_global_data()

The ’signal’ statement selects the

NIFTy

Field representing the signal s and the to_global_data() method returns the Field ’s value on the regular grid as a

NumPy array. To get the inferredmean value of the correlation length l c as a float the following statement does the trick: l_c_val = sampler.mean['l_c'].to_global_data()[0]print(l_c_val) The same thing is possible by loading the mean value from the respective

HDF5 ﬁle: mean_val = hmcf.load_mean(path/to/file)['signal']l_c_val = hmcf.load_mean(path/to/file)['l_c'][0]

Obviously, this is possible even if the

HMCSampler instance was deleted already (for exampleafter a reboot of your system).The variance of the samples can be obtained in the same way using the var attribute of the

HMCSampler class or calling the load_var function. The results of the sampling process aredisplayed in ﬁgure 2.The most prominent diﬀerence between the original ﬂux x and the HMC mean value is wherethe instrument was broken in the bottom right region. In particular, the standard deviationof the samples drawn from the full posterior distribution is remarkably high there as onewould expect since information is missing.The samples itself can be loaded by either using the samples attribute of the HMCSampler class or calling the load function of

HMCF . As an example the shape of the returned

NumPy array can be displayed with print(hmcf.load(path/to/file.h5)['signal'].shape) (6, 500, 100, 100) where the ﬁrst element reﬂects the number of independent Markov chains, the second elementis the number of sample each chain generated and the third and fourth element reﬂect theregular grid on which the whole process was carried out.To have a better overview of these samples we can use the show_trajectories functionof

HMCF which displays the chain trajectories through parameter space by pixel. It is aninteractive GUI consisting of two graphs as depicted in ﬁgure 3. The left graph shows theinferred mean value and the right graph trajectories of each chain for one selected pixel. Thepixel can either be set in the top row by deﬁning the coordinates and then clicking on “show”or by just clicking on some pixel in the left graph. The function itself is located in the

HMCF tools sub-package and can be called by executing4

HMC Sampling for Fields x hmc meandifference hmc std0.50.00.51.01.52.02.53.0 0.50.00.51.01.52.02.53.00.10.20.30.40.50.60.70.8 0.080.100.120.140.160.180.200.22

Figure 2: Results of the multi_ﬁeld_demo.py script. In the upper row the true photon ﬂux x along with the reconstructed picture based on the samples generated by the HMC algorithm.The reconstructed picture gets very blurry where the instrument was broken in the bottomright region and information got lost. In the second row the absolute diﬀerence between theoriginal ﬂux and the reconstructed picture and the standard deviation of the HMC samplesis displayed. Again, the region in which the instrument is broken is very prominent in bothcases. from hmcf.tools import show_trajectoriesshow_trajectories(path/to/file, field_name='signal') where the field_name statement deﬁnes which element of the MultiField is displayed. ournal of Statistical Software s in our example is displayed. In the right graph the trajectoriesof the six Markov chains at the pixel coordinates ( x = 12 , y = 52) (as stated in the top row)are shown.

5. Summary

Eﬃcient HMC sampling with the high number of degrees of freedom of a numerically repre-sented ﬁeld is a very complicated task.

HMCF takes care of most challenges arising whileworking on such problems. It provides good default values and adjusting strategies for crucialparameters such as the integration step size or the mass matrix. Nonetheless the user is stillable to customize many details of how the sampler deals with a given problem. These featuresinclude• Simultaneous sampling of all free parameters and hyperparameters• Setting the order of the symplectic integrator• Deﬁning the adjustment strategy for (cid:15) and related properties• Deﬁning the convergence measure strategy and related properties• Deﬁning how often and how well the mass matrix is reevaluated• Providing a clear, in-depth overview of relevant parameters of all chains especially duringburn-in phase6

HMC Sampling for Fields

Apart from a diverse set of diﬀerent options to choose from the structure of the moduleeven eases the creation of new, customized options. We explained the usage of

HMCF anddemonstrated its performance using the demonstrator coming with the

HMCF package.

Acknowledgments

We would like to thank all members of the Information Field Theory group at MPA for manyuseful discussions in particular, and alphabetical order: Philipp Arras, Jakob Knollmüller,Daniel Pumpe, Reimar Leike and Martin Reinecke. ournal of Statistical Software References

Beskos A, Pillai N, Roberts G, Sanz-Serna JM, Stuart A, et al. (2013). “Optimal tuning ofthe hybrid Monte Carlo algorithm.”

Bernoulli , (5A), 1501–1534.Betancourt M (2017). “A conceptual introduction to Hamiltonian Monte Carlo.” arXivpreprint arXiv:1701.02434 .Betancourt M, Byrne S, Girolami M (2014). “Optimizing the integrator step size for Hamil-tonian Monte Carlo.” arXiv preprint arXiv:1411.6669 .Blanes S, Casas F, Sanz-Serna J (2014). “Numerical integrators for the Hybrid Monte Carlomethod.” SIAM Journal on Scientiﬁc Computing , (4), A1556–A1580.Dalcìn L, Paz R, Storti M (2005). “MPI for Python.” Journal of Parallel and Dis-tributed Computing , (9), 1108 – 1115. ISSN 0743-7315. doi:https://doi.org/10.1016/j.jpdc.2005.03.010 . URL .Daniel Buscombe (2016). PySESA : Python framework for spatially explicit spectral analysis.

Python package version 0.0.19, URL https://github.com/dbuscombe-usgs/pysesa .Duane S, Kennedy AD, Pendleton BJ, Roweth D (1987). “Hybrid monte carlo.”

Physicsletters B , (2), 216–222.Enßlin TA, Frommert M, Kitaura FS (2009). “Information ﬁeld theory for cosmologicalperturbation reconstruction and nonlinear signal analysis.” Physical Review D , (10),105005.Enßlin TA, Weig C (2010). “Inference with minimal Gibbs free energy in information ﬁeldtheory.” Physical Review E , (5), 051112.Gelman A, Rubin DB (1992). “Inference from iterative simulation using multiple sequences.” Statistical science , pp. 457–472.Hanson KM (2001). “Markov Chain Monte Carlo posterior sampling with the Hamiltonianmethod.” In

Medical Imaging 2001: Image Processing , volume 4322, pp. 456–468. Interna-tional Society for Optics and Photonics.Hastings WK (1970). “Monte Carlo sampling methods using Markov chains and their appli-cations.”

Biometrika , (1), 97–109.Hoﬀman MD, Gelman A (2014). “The No-U-turn sampler: adaptively setting path lengthsin Hamiltonian Monte Carlo.” Journal of Machine Learning Research , (1), 1593–1623.Hubble E (1934). “The distribution of extra-galactic nebulae.” The Astrophysical Journal , , 8.Hunter JD (2007). “Matplotlib: A 2D Graphics Environment.” Computing in Science Engi-neering , (3), 90–95. ISSN 1521-9615. doi:10.1109/MCSE.2007.55 .8 HMC Sampling for Fields

Junklewitz H, Bell M, Selig M, Enßlin T (2016). “RESOLVE: A new algorithm for aperturesynthesis imaging of extended emission in radio astronomy.”

Astronomy & Astrophysics , , A76.Knollmüller J, Steininger T, Enßlin TA (2017). “Inference of signals with unknown correlationstructure from non-linear measurements.” arXiv preprint arXiv:1711.02955 .Martin Reinecke Theo Steininger MS (2018). nifty4 : a package implementing numericalmethods for Information Field Theory . Max-Planck-Society. Python pacakge version 4.1.0,URL http://ift.pages.mpcdf.de/NIFTy .Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953). “Equation ofstate calculations by fast computing machines.”

The journal of chemical physics , (6),1087–1092.Neal RM, et al. (2011). “MCMC using Hamiltonian dynamics.” Handbook of Markov ChainMonte Carlo , (11).Oliphant TE (2007). “Python for Scientiﬁc Computing.” Computing in Science Engineering , (3), 10–20. ISSN 1521-9615. doi:10.1109/MCSE.2007.58 .Pumpe D, Reinecke M, Enßlin TA (2018). “Denoising, Deconvolving and Decompos-ing multi-Dimensional Photon Observations-The D4PO Algorithm.” arXiv preprintarXiv:1802.02013 .Python Software Foundation (2018). logging : Python framework for logging messages.

Part of

Python since 2.3 including 3.x, URL https://docs.python.org/3.6/library/logging.html .Salvatier J, Wiecki T, Fonnesbeck C (2018). pyMC3 : Python package for all sorts of MCMCalgorithms . Python package version 3.4.1, URL https://docs.pymc.io .Selig M, Bell MR, Junklewitz H, Oppermann N, Reinecke M, Greiner M, Pachajoa C, EnßlinTA (2013). “NIFTY - Numerical Information Field Theory. A versatile PYTHON libraryfor signal inference.”

Astronomy & Astrophysics , , A26. doi:10.1051/0004-6361/201321236 . .Selig M, Vacca V, Oppermann N, Enßlin TA (2015). “The denoised, deconvolved, and decom-posed Fermi γ -ray sky-An application of the D3PO algorithm.” Astronomy & Astrophysics , , A126.Stan Development Team (2017). PyStan : Python package for HMC and NUTS . Python package version 2.16.0.0, URL http://mc-stan.org .Steininger T, Dixit J, Frank P, Greiner M, Hutschenreuter S, Knollmüller J, Leike R, Por-queres N, Pumpe D, Reinecke M, et al. (2017). “NIFTy 3-Numerical Information FieldTheory-A Python framework for multicomponent signal inference on HPC clusters.” arXivpreprint arXiv:1708.01073 .Taylor J, Ashdown M, Hobson M (2008). “Fast optimal CMB power spectrum estimationwith Hamiltonian sampling.”

Monthly Notices of the Royal Astronomical Society , (3),1284–1292. ournal of Statistical Software HDF5 . URL https://portal.hdfgroup.org .Yoshida H (1990). “Construction of higher order symplectic integrators.”

Physics letters A , (5-7), 262–268.0 HMC Sampling for Fields

A. Dynamic Step-Size Adjustment

Hamiltonian Monte Carlo needs much more computation time per sample proposal than otherMCMC approaches. This is a disadvantage that is compensated by much higher acceptancerates and lower autocorrelation between samples. Because of the energy conserving propertyof Hamiltonian dynamics, the acceptance rate is only dependent on the performance of thenumerical integrator. The numerical integrator has one free parameter: the step size (cid:15) . Inprinciple the bigger it is the bigger is the integration error ∆ E and thereby the smaller theacceptance rate. On the other hand, a too small value for (cid:15) results in a sampler which doesnot explore the typical set on bearable timescales. In practice an acceptance rate of 0.7 to 0.9(Beskos et al. et al. (cid:15) during the burn-in phase such that a user-deﬁned acceptance rateis matched. The ﬁrst step in constructing a good strategy is to derive a relation betweenacceptance rate r A and the integration error ∆ E . We developed the following approximation.Given the acceptance probability ρ A from equation (3), the expected acceptance rate is givenby r A ( (cid:15) ) = h ρ A (∆ E ) i P (∆ E | (cid:15) ) = D min (cid:16) , e − ∆ E (cid:17)E P (∆ E | (cid:15) ) (13)where P (∆ E | (cid:15) ) is the probability distribution for ∆ E conditioned on (cid:15) .To tackle the min function properly it is assumed that the probability distribution for thesign of ∆ E is not dependent on the absolute value of ∆ E : P (∆ E ) ≈ P ( | ∆ E | ) P (sgn(∆ E ))This reﬂects the plausible situation that errors are symmetrically probable displacements oftrajectories in regions of the phase space that are dominated by a potential gradient and notby a minimum. In this case we can further assume that P (sgn(∆ E ) = 1) ≈ P (sgn(∆ E ) = − ≈ . . With this, equation (13) can be written as r A ( (cid:15) ) = 12 (cid:18) D e −| ∆ E | E P ( | ∆ E || (cid:15) ) (cid:19) ≈

12 (2 − h| ∆ E |i ) + O ( h| ∆ E | i ) , where the exponential-function was expanded to ﬁrst order.In practice a certain value for r A like 0 . E h| ∆ E |i ! = 2(1 − r A ( (cid:15) )) =: ∆ E wanted (14)This is the relation that lies at the core of most epsilon-adjusting strategies available in HMCF . Note that even if a step is accepted for sure because ∆ E is negative, adjusting (cid:15) is still possible since only the absolute value of ∆ E is needed. This is of great use in caseswhere nearly every step during burn-in produces a negative ∆ E (This happens sometimes ifthe Markov chains start far oﬀ the mean value). For more on how (cid:15) is adjusted exactly seesection B.3. ournal of Statistical Software B. Detailed Software Description

This section describes all features and functionalities of

HMCF . B.1. The

HMCSampler

Class

The

HMCF package is optimized for a fast HMC implementation for a

NIFTy

Energy class.At its heart lies the

HMCSampler class handling the whole sampling process. It is able torun several Markov chains on diﬀerent CPUs using the

Python multiprocessing module. Thesamples are saved to an HDF5-ﬁle which is generated every time a new run is initialized andcan be loaded back as needed via the package’s own load function. During the burn-in phase

HMCSampler takes care of adjusting the integration step size (cid:15) (see equation (4)) such that auser deﬁned acceptance rate is reached, as well as setting and possibly reevaluating the massmatrix M . After a run has ﬁnished the mean of the samples is calculated.Of course in practice one may want to ﬁne-tune some of the speciﬁc features and parametersimplemented in HMCF . This section is dedicated to introduce and explain those.

Instantiating

An instance of

HMCSampler is created with the following arguments of which only the ﬁrst ismandatory: potential : NIFTy

Energy

The HMC potential Ψ( x ). Also deﬁnes the domain on which the sampling takes placethrough its position attribute. sample_transform : func , optionalIn some cases it is preferable to sample a ﬁeld not in its position space, but in anotherdomain, such as in its harmonic space representation, or maybe even in a domain wherethere is no linear transformation to the position space. To ensure correct calculationof expectation values such as the mean or the variance the samples are transformed by sample_transform before being saved to disk. The sample_transform function has tohave exactly one argument which is a Field or MultiField similar to the position at-tribute of the

NIFTy

Energy given as the potential argument. The sample_transform function has to return a

Field or MultiField . There are, however, no further restric-tions on the exact structure of this

Field or MultiField . num_processes : int Number of cores involved in the sampling process. This is equal to the number ofindividual Markov chains started when the instance method run is called.Default: 1 sampler_dir_path : str A path where the HDF5-ﬁle containing the samples is going to go.Default: a new folder called “samples” in the ‘ __main__ ’ script’s directory

Running the Sampling Process

In principle the sampling process can be started immediately after creating an instance of2

HMC Sampling for Fields

HMCSampler by calling the run() method with the following arguments of which again onlythe ﬁrst is mandatory. num_samples : int Number of samples to be drawn per chain after burn in. max_burn_in : int , optionalMaximum number of steps for the chain to converge before it is forced into samplingmode. If no value is stated, forced transition is not going to happen. In this case thechain will only start the actual sampling process if it has converged. convergence_tolerance : float , optionalIf the convergence measure for the sampling process falls below this value, the chain isassumed to have converged and starts with the actual sampling process. If no value isstated the tolerance property of the HMCSampler property convergence is used. Thedefault value for said property is 1. For more on this see section B.2. target_acceptance_rate : float , optionalValue between 0. and 1., stating what ratio of accepted / rejected samples is preferred.The integration step size is adjusted during burn-in to approximately match this ratio.If not stated the corresponding property of the epsilon property is used (for which thedefault value is 0.8). order : int The order of the symplectic integrator. The default value corresponds to a simpleleapfrog integration. Default: 2 mass : NIFTy

EndomorphicOperator , optionalHMC mass matrix used during sampling (or until it is reevaluated). For more on themass matrix see section B.4. If no mass is given, an identity matrix is used (at least asinitial guess). x_initial : NIFTy

Field or list of NIFTy

Field s, optionalStarting point(s) for the HMC sampler. If more than one Markov chain needs to beinitialized they get their respective initial positions by iterating through the list. Thelist does not have to have the same length as the number of chains. If there are morechains than elements in the list , some starting positions are reused for the additionalchains. If only a

Field is given, all chains get the same initial position. If no initialﬁeld is passed, a random sample is drawn from a Gaussian distribution centered atthe position of the

Energy instance given to the constructor of

HMCSampler with the

Energy ’s metric as covariance.

Getting the Results of an HMC-Run

After a run has ﬁnished, the sampled mean as a

NIFTy

Field or MultiField is accessiblevia the instance’s property ‘ mean ’. In [1]: hmc_sampler = HMCSampler ( nifty_energy , num_processes =5) In [2]: hmc_sampler . run (200) ournal of Statistical Software In [3]: hmc_sampler . mean Out[3]: nifty4 . Field instance - domain = DomainTuple , len : 1 RGSpace ( shape =(256 , 256) , distances =(0.5 , 0.5) , harmonic = True ) - val = array ([[ 0.63 , - 0.16 , ... , 1.04 , - 0.64] , ... , [ 0.03 , 0.02 , ... , 1.22 , 0.21 ]]) Accessing the values of the samples is possible via calling the samples property. It consists ofa 2+ n dimensional NumPy ndarray where the ﬁrst dimension represents the diﬀerent MarkovChains, the second dimension represents the individual samples and the other n dimensionsrepresent the value of the sampled NIFTy

Field s. If the sampled objects were

MultiField san dictionary with

NumPy ndarray s is returned. Remember though that calling this will loadall samples into memory which might crash the process if not enough memory is available. In [4]: all_samples = hmc_sampler . samples In [5]: all_samples . shape Out[5]: (5 , 200 , 256 , 256)

Attributes and Properties of

HMCSampler

The

HMCSampler has a number of other properties and attributes which are mostly used forﬁne-tuning the sampling process. These are: potential : NIFTy

Energy , read-onlyThe potential Ψ for the HMC sampler. sampler_dir : str Setting or getting the path to the directory where the sample-ﬁles are stored. Corre-sponds to the parameter of the same name passed in the constructor of

HMCSampler class. save_burn_in_samples : bool Whether or not to save the samples generated during burn-in phase to disk. Be awareof the fact that if set to

True (default value) together with a high or non-existent max_burn_in parameter (in the constructor of

HMCSampler class) could ﬁll your harddrive.Default: True burn_in_samples : NumPy ndarray or dict ( str -> ndarray ), read-onlyThe same as the samples property but with the samples generated during burn-in phase. var : NIFTy

Field or MultiField, read-onlyThe variance of the samples (after sample_transform ). convergence : HMCF

Convergence

For choosing how to calculate the convergence measure (see section B.2).Default:

HansonConvergence if num_processes == 1 else GelmanRubinConvergence HMC Sampling for Fields epsilon : HMCF

Epsilon

For choosing how to adjust the integration step size parameter during burn-in (seesection B.3).Default:

EpsilonPowerLawDivergencemass : HMCF

Mass

For choosing how to handle the HMC mass during sampling. For more on this seesection B.4. display : HMCF

Display

For choosing how to display the progress of the sampling process (see section B.5)Default:

LineByLineDisplayn_limits : list of int sTo avoid periodic trajectories the number of leapfrog integration steps is randomized. n_limits deﬁnes the range from which the number of integration steps is drawn uni-formly.Default: [60, 70] B.2.

Convergence

The

Convergence class handles everything related to the convergence of the Markov chain(s).In principle a chain in

HMCF has converged if a convergence measure calculated for eachdegree of freedom in the sampled

NIFTy

Field or MultiField drops below a given tolerance .Additionally

HMCF implements intermediate steps of convergence via so-called convergence levels . For the time being their main purpose is to deﬁne a time where the HMC mass isreevaluated during burn-in (See also section B.4). A chain is said to have converged withrespect to its current convergence level, ifmax(measure) < tolerance · level (15)In other words: The level is the number of digits the decimal separator of the tolerance isshifted to the left.The idea behind this is to decrease the level by one each time an intermediate convergence isreached, while at that point recalculating the mass matrix.If the level drops to zero, equation (15) simpliﬁes to max(measure) < tolerance and the nexttime this requirement is met, the Markov chain has ﬁnished the burn-in phase.It remains to explain how the convergence measure is calculated. There are several diﬀerentapproaches implemented in HMCF as child classes of the

Convergence class. Choosing oneof them is done by setting the convergence property of the

HMCSampler class with one of the

Convergence ’s child classes, e.g.: hmc_sampler.convergence = GelmanRubinConvergence

For now there are four diﬀerent possibilities:

MeanConvergence

This rather simple way of calculating the convergence needs at least twoMarkov chains. It compares the mean of the samples from all chains ( total_mean ) to the ournal of Statistical Software chain_means ). The measure is deﬁned as abs(chain_mean/ total_mean - 1.) such that it fulﬁlls the non-negativity and the identity of indis-cernibles criteria for metrics. It proves to be rather unstable if e.g. the total mean isclose to zero.

VarConvergence

Very similar to

MeanConvergence only with the variances of individualchains and all chains. Measure is equal to abs(chain_var / total_var - 1.) . HansonConvergence

So far the only convergence measure which can be used even if there isonly one chain. It follows Hanson (2001) (Again: the measure is the absolute value ofthe ratio minus one).

GelmanRubinConvergence

An implementation of the among MCMC folks very popular Gel-man and Rubin convergence criteria Gelman and Rubin (1992) (Again: the measure isthe absolute value of the ratio minus one).

Attributes and Properties of

Convergence

Regardless of which

Convergence child class has been used additional features can be set viaits class properties, e.g. the locality property which deﬁnes the number of recent samplesconsidered when calculating the convergence (see below for details): hmc_sampler.convergence = GelmanRubinConvergencehmc_sampler.convergence.locality = 200

For the common user the following properties are the most important ones: locality : int The number of recent samples to be considered in calculating the convergence measure.On the one hand this is a form of ‘forgetting’ very old parts of the chain’s trajectorywhich do not represent the current state of convergence. On the other hand this isnecessary because of memory issues i.e. if the burn-in phase takes very long the memorywould blow up since every sample ever created has to be available to calculate themeasure.Default: 250 tolerance : float Equivalent to the convergence_tolerance parameter of the

HMCSampler ’s run method.In fact, setting this property as described above has only an eﬀect if the (optional) convergence_tolerance parameter is not passed to the run method.In practice the latter approach might be slightly more convenient. If the maximumvalue of a chain’s convergence measure is below this value the chain is said to haveconverged and transitions from the burn-in phase into the sampling phase. See also: converged (below)Default: 1.The following additional properties of Convergence are mostly only important for

HMCSampler itself and not of relevance for the common user:6

HMC Sampling for Fields converged : NumPy ndarray of bool (1 dim)Contains the information of whether the individual chains have converged with respectto the following law: converged = measure_max < tolerance * 10**levelmeasure : NumPy ndarray (1 + n dim)Represents the value of the measure (calculated dependent on which child class of the Convergence class has been used) for each element of the degrees of freedom in thesampled

NIFTy

Field . The ﬁrst dimension represents the individual chains. measure_max : NumPy ndarray (1 dim)The highest value of the

Convergence class property ‘ measure ’ per chain. level : NumPy ndarray of int (1 dim)See class property converged . The idea is that after a Markov chain has convergedwith respect to its current level the level is decreased by one. There are Convergence class methods dec_level and inc_level for decreasing and increasing the level by 1,respectively. For more details on these methods see below.Setting this property is also possible with a simple int which sets the whole

NumPy ndarray to that value. quota : NumPy ndarray of float (1 dim)The ratio of elements in the sampler’s position NIFTy

Field which have converged withrespect to tolerance and level (i.e. the ’intermediate’ convergence)

Additional Methods of

Convergence

Internally the convergence levels are decreased and increased by calling dec_level(chain_identifier=None)

Decreases the convergence level of chain_identifier ( int ) by one. If chain_identifier is None the level of all chains is decreased by one. Either way if the level of a chain isalready zero it is left unchanged. inc_level(chain_identifier=None)

Increases the convergence level of chain_identifier ( int ) by one. If chain_identifier is None the level of all chains is increased by one.The convergence level is set under the hood dependent on speciﬁc properties of the

Mass classin the beginning of the

HMCSampler ’s run method. B.3. Epsilon

The (cid:15) parameter deﬁnes the leapfrog integration step size (equation (4)). In principle thebigger it is the bigger is the integration error ∆ E and thereby the smaller the acceptancerate. To achieve an approximate acceptance rate deﬁned via the target_acceptance_rate parameter of HMCSampler ’s run method, (cid:15) has to be adjusted during burn-in. Every adjusting-strategy relies on thoughts presented in appendix A connecting the target_acceptance_rate ournal of Statistical Software E . In HMCF the

Epsilon class, much like the

Convergence class, isjust a base class and much more interesting for the common user are its child classes deﬁningexactly how (cid:15) is adjusted.The class also keeps track of how much (cid:15) has changed in recent steps and how close the meanvalue of recent integration errors h ∆ E i is to ∆ E wanted . If (cid:15) has not changed very much and h ∆ E i ≈ ∆ E wanted , Epsilon is said to have converged.If

Epsilon has converged its value is locked.

Available Adjusting Strategies

EpsilonConst (cid:15) stays constant throughout the whole sampling process. The value can beset by setting its val attribute: hmc_sampler.epsilon = EpsilonConsthmc_sampler.epsilon.val = 0.005EpsilonSimple (cid:15) gets reduced or increased if ∆ E is bigger or smaller than ∆ E wanted respec-tively independent of the absolute value of ∆ E .In practice, EpsilonSimple has an attribute change_range ( float between 0 and 1,Default: 0.1), which can be set via: hmc_sampler.epsilon = EpsilonSimplehmc_sampler.epsilon.change_range = 0.2 This attribute is only available in

EpsilonSimple . Given the change_range the currentvalue of (cid:15) is multiplied by a factor drawn from a uniform distribution U ([ a, b ]) where[ a, b ] = ( [1 − change_range ,

1] if ∆

E > ∆ E wanted [1 , change_range ] if ∆ E < ∆ E wanted The randomness is necessary to prevent recurrent behavior if the integration error ∆ E is very sensitive to (cid:15) . EpsilonPowerLaw (cid:15) gets adjusted just like

EpsilonSimple but the change_range is nowdeﬁned by the relative diﬀerence between ∆ E and ∆ E wanted ( EpsilonPowerLaw has noattribute change_range !).Given this class’s attribute power (positive int , Default: 5), set via hmc_sampler.epsilon = EpsilonPowerLawhmc_sampler.epsilon.power = 4 the change_range in EpsilonSimple is deﬁned as: change_range = (cid:12)(cid:12)(cid:12)(cid:12) ∆ E − ∆ E wanted ∆ E + ∆ E wanted (cid:12)(cid:12)(cid:12)(cid:12) power (16) EpsilonPowerLawDivergence

In practice working with Poissonian or log-normal distribu-tions on high dimensional spaces the integration error ∆ E proved to be very sensitiveto small changes in (cid:15) . With this class (cid:15) is adjusted just like EpsilonPowerLaw with the8

HMC Sampling for Fields diﬀerence, that in case of a divergent ∆ E (e.g. during integration an overﬂow occurs)the change_range becomes more sensitive.A divergence_counter keeps track of the number of times a divergent behavior was de-tected and the change_range deﬁned in equation (16) gets a prefactor 2 − divergence_counter EpsilonExponential

In this case a simple connection between ∆ E and (cid:15) is assumed: | ∆ E | ( (cid:15) ) = a · (cid:15) b (17)where a > b > E tends to zero if (cid:15) does and diverges for large (cid:15) . Former ‘measurements’, i.e.sampling steps of ∆ E given (cid:15) are used to calculate a and b . This approach asks for arather large value for the locality property (see below). (cid:15) is adjusted by rearrangingequation (17) such that: (cid:15) new = (cid:18) a | ∆ E wanted | (cid:19) b (18) EpsilonOneOverX

In this case another connection between ∆ E and (cid:15) is assumed: | ∆ E | ( (cid:15) ) = a ( (cid:15) − (cid:15) ) b + const (19)where a > b > | ∆ E | ( (cid:15) =0) = 0. The idea behind this relation is an updated version of EpsilonExponential where there is a ﬁnite (cid:15) for which ∆ E diverges already. a and b are again ﬁtted withformer ∆ E s given (cid:15) , whereas (cid:15) is set to the current value of (cid:15) every time a divergentbehavior is detected. (cid:15) gets adjusted by rearranging equation (19), such that (cid:15) new = (cid:15)  − (cid:15) b ∆ E wanted a ! b  (20) Attributes and Properties of

Epsilon

Regardless of which

Epsilon class has been used, additional features can be set via its classproperties, e.g. the locality property which deﬁnes the scope of ﬁtting points for classes like

EpsilonExponential and for calculating the convergence measure (see below for details): hmc_sampler.epsilon = EpsilonPowerLawDivergencehmc_sampler.epsilon.locality = 20

For the common user the following properties and attributes are the most important ones: val : float The (current) value of (cid:15) . It is possible to use this property to set a good initial guess(although most of the time unnecessary).Default: 0.005 ournal of Statistical Software locality : int The number of recent samples to be considered in calculating the convergence measureof epsilon .Default: 50 target_acceptance_rate : float Equivalent to the target_acceptance_rate parameter of the

HMCSampler ’s run method.In fact, setting this property only has an eﬀect if the (optional) target_acceptance_rate parameter is not passed to the run method.Default: 0.8 convergence_tolerance : float Essentially the same thing as the tolerance property of

Convergence (section B.2) butfor epsilon

A value > epsilon is calculated (see below)Default: 0.5 divergence_threshold : float The value of ∆ E for which the integration is said to have diverged.Default: 1 E epsilon_limits : list of float Minimum and maximum value for

Epsilon . If the adjusting algorithm proposes a value‘out of range’ the value gets coerced.Default: [1.E-40, 10]

Under the hood

HMCSampler uses the following additional properties of

Epsilon : converged : bool , read-onlyWhether or not epsilon has converged or not, i.e. whether the convergence_measure is smaller than the convergence_tolerancemeasure : list of float , read-onlyThe convergence measure for epsilon contains two values. The ﬁrst is the relativevariance of the value (cid:15) , the second represents how close the mean value of ∆ E is to∆ E wanted . If both are smaller than convergence_tolerance , epsilon has converged. B.4. Mass

Finding an appropriate mass matrix is the most challenging task for a good HMC sampler.

HMCF provides the user with the standard evaluation procedure introduced in equation (5)as well as the possibility to deﬁne a problem speciﬁc mass matrix.(Re-)evaluation is done by drawing samples { x ( i ) } from the curvature at the current position ofthe NIFTy

Energy given as potential parameter in the constructor. To keep the complexityof the problem bearable only the diagonal of the mass matrix in equation (5) is calculatedand used: M jk = δ jk N − N X i =1 x ( i ) j ! − (21)0 HMC Sampling for Fields

This also removes the problem that for a non-degenerate mass matrix in n dimensions at least n independent samples are required. For typical applications in NIFTy n = 10 .. easily.The idea behind the HMCF

Mass class is to easily deﬁne a strategy of handling the massmatrix of an HMC process. This includes reevaluating the mass matrix several times duringburn-in phase and deﬁning an initial mass matrix. As default the identity is used as massmatrix but an initial mass matrix can be evaluated without reaching any level of convergence.By default there is one such initial mass evaluation and no reevaluations. get_initial_mass : bool Whether or not to evaluate an initial mass. Setting this to

True / False will increase/de-crease reevaluations by 1 respectively.Default:

Truereevaluations : NumPy ndarray of int The number of reevaluations (including the initial one if set) for each chain. Reevalua-tion takes place if the chain has converged with respect to its current convergence levelintroduced in section B.2. Setting the number of reevaluations is also possible (andrecommended) with a simple int which sets all chains to that int .Default: numpy.ones(num_chains) (i.e. one reevaluation for every chain) operator : NIFTy

EndomorphicOperator

The actual mass operator. If there is a problem speciﬁc mass operator it can be set withthis property before initiating the run. If get_initial_mass is True , setting operator will set get_initial_mass to False and decrease reevaluations by 1.Default: Identity (as

NIFTy

DiagonalOperator ) num_samples : int Deﬁnes the number of samples drawn from the curvature when reevaluating the massmatrix.Default: 50 shared : bool If True reevaluating a mass matrix is done at the mean of all individual chain positions.All chains have to meet the conditions for evaluation mentioned above. Afterwardseach chain gets the same mass matrix. If

False each chain reevaluates its mass matrixindividually if the chain meets the conditions for evaluation.Default:

False

B.5. Display

Naturally HMC sampling can take quite a while. Disadvantageous settings of parametersmight lead to a malfunctioning sampling process. To be able to discover a pathological runcan save hours or even days. For this reason

HMCF oﬀers a number of display modes fordiagnostic purposes. On the other hand displaying indicators of tens of parallel running chainsmight be overwhelming. All available display options have in common that they display moreor less information based on a logging level just like the logging levels in the logging package(Python Software Foundation 2018) of the

Python standard library. It is in fact possible touse the appropriate module variables from the logging package. These are: ournal of Statistical Software CRITICAL (numerical value: ) ERROR (numerical value: ) WARNING (numerical value: ) INFO (numerical value: ) DEBUG (numerical value: ) NOSET (numerical value: )The logging level can be set using the display s method setLevel e.g., from logging import INFOhmc_sampler.display.setLevel(INFO) Similar to the

Epsilon and

Convergence classes there are several

Display classes whichdeﬁne the three displaying modes.

Display : Displays nothing at all. Serves as base class for the two other display classes.

LineByLineDisplay : If the level of the

HMCF logger is set to

INFO or below certain in-dicators are printed at every sampling step as depicted in ﬁgure 4a). Otherwise onlywarnings and errors are printed.

TableDisplay : The most advanced version of displaying indicators. A table is generatedand dynamically updated, containing information for each chain as depicted in ﬁgure4b). This class relies heavily on the curses

Python package and therefore alters theterminal behavior during sampling.The columns in

TableDisplay display the following parameters: ch The chain number or ‘identiﬁer’. acc. rate

The acceptance rate for each chain. During burn in this is only the accep-tance rate of the last 10 samples since this highly depends on the value of

Epsilon . After burn in acc. rate displays the overall acceptance rate. dEnergy

The most recent value of the integration error ∆ E . Convergenceconv.

Whether the chain has converged with respect to the current convergencelevel. lev

The current convergence level. meas

The maximum value of the convergence level of the chain. quota

The percentage of points in the sampled

NIFTy

Field which have con-verged with respect to the current convergence level.

Epsilonvalue

The current value of

Epsilon . conv. Whether or not

Epsilon has converged with respect to its measure. meas.

The maximum value of the

Epsilon convergence measure.

Additional Methods of

Display

Regardless of which

Display class has been used the quantity of displayed information canbe changed via the setLevel method.2

HMC Sampling for Fields a) LineByLineDisplay with level = DEBUG during burn inb)

TableDisplay with level = DEBUG during burn in

Figure 4 ournal of Statistical Software setLevel(level) This is equivalent to the setLevel function in the

Python package logging . Default: logging.INFO

B.6. Additional Functions

HMCF provides a number of additional functions targeted at easing the handling of theHDF5-ﬁles generated during sampling. These ﬁles have rather cryptic default names of theform ‘runYYYY-MM-DD_hh-mm-ss.h5’, where Y, M, D, h, m and s represent year, month,day, hour, minute and second digits at the time of the initialization of that run, respectively.There are three functions handling data stored in these ﬁles: load , load_mean , load_var The load

Function

The load function takes the following arguments of which only the ﬁrst is mandatory. Keepin mind that for high dimensional problems the amount of data loaded to memory can easilybe several GiBytes. path : str Either the path to the HDF5 ﬁle or a path to the directory where the HDF5 ﬁle(s) arestored. If ( path) is a directory, the latest ‘run’ ﬁle (with respect to the ﬁle name) isloaded. attr_type : str The ‘attribute’ to load. For ﬁles generated with the

HMCSampler class possible val-ues are: ’burn_in’ and ’samples’ . For ’burn_in’ the samples generated duringburn in are loaded (of course, this is only possible if the

HMCSampler class attribute save_burn_in_samples was set to

True ). For ’samples’ the samples generated afterburn in are loaded. Default: ’samples’start : int Only loads the samples from step start onward.Default: 0 stop : int , optionalOnly loads the samples up to step stop . Loads until the end if no value is passed step : int Only loads every n th sample, where n is given by step .Default: 1In all cases the function returns a NumPy ndarray if Field s were sampled and

Python dictionariesif

MultiField s were sampled. In the latter case the respective

NumPy ndarray for each in-dividual ﬁeld can be obtained by using the key used in the

MultiField . In all cases thearray has 2 + n dimensions, where again, the ﬁrst and second dimension represent chainsand steps respectively. If not all chains have the same number of samples (e.g. attr_type= ’burn_in’ , every chain needs a diﬀerent number of steps to reach convergence) shorterchains are ﬁlled with numpy.nan s in the output array to match the size of the longest chain.4 HMC Sampling for Fields

The load_mean

Function

The load_mean function calculates the mean value based on the samples saved to a HDF5ﬁle. path : str Either the path to the HDF5 ﬁle or a path to the directory where the HDF5 ﬁle(s) arestored. If ( path) is a directory, the latest ‘run’ ﬁle (with respect to the ﬁle name) isloaded. domain : NIFTy

Domain or NIFTy

MultiDomain , optionalIf domain is given the function output is a

NIFTy

Field or a

NIFTy

MultiField withthe respective domain and the calculated mean as value.The function returns the mean value either as

NumPy ndarray a Python dictionary of

NumPy ndarray s, a

NIFTy

Field or a

NIFTy

MultiField dependent on what was sampled originallyand whether the domain argument is given.

The load_var

Function

The load_var function calculates the variance value based on the samples saved to an HDF5ﬁle. path : str Either the path to the HDF5 ﬁle or a path to the directory where the HDF5 ﬁle(s) arestored. If ( path) is a directory, the latest ’run’ ﬁle (with respect to the ﬁle name) isloaded. domain : NIFTy

Domain , optionalIf domain is given the function output is a

NIFTy

Field with domain domain and thecalculated variance as value.The function returns the variance either as

NumPy ndarray a Python dictionary of

NumPy ndarray s, a

NIFTy

Field or a

NIFTy

MultiField dependent on what was sampled originallyand whether the domain argument is given.

B.7. Tools

Tools is a sub-module of

HMCF which provides two handy functions to evaluate an alreadyﬁnished sampling process. The one, show_trajectories , provides a GUI to quickly visualizethe individual Markov chains, the other, get_autocorrelation , calculates the autocorrela-tion of the chains.

The show_trajectories

Function

A simple GUI for visualizing Markov chains based on one- or two-dimensional problems. TheGUI is divided into two graphs as displayed in ﬁgure 3. The left one represents the underlyingproblem space i.e. its geometry with either the ‘mean’ value or a similar user-deﬁned ﬁeld.The right graph displays the trajectory of the diﬀerent chains for a selected pixel in the left ournal of Statistical Software path : str Either the path to an HDF5 ﬁle or a path to the directory where the HDF5 ﬁle(s) arestored. If ( path) is a directory, the latest ‘run’ ﬁle (with respect to the ﬁle name) insaid directory is loaded. reference_field : NIFTy

Field or MultiField , NumPy ndarray or dict(str -> ndarray) ,optionalThe ﬁeld displayed on the left graph of the GUI. If none is given, the mean value of thesamples is used as reference. field_name : str, optionalIf MultiFields were used, this argument is mandatory and speciﬁes which of the sub-ﬁelds is supposed to be displayed. solution : NIFTy

Field or MultiField , NumPy ndarray or dict(str -> ndarray) , op-tionalIn case a ‘right’ answer is available (e.g. a mock data example) it can passed here andis displayed in the trajectories graph as horizontal line as additional reference. start : int Only loads the samples from step start onward.Default: 0 stop : int , optionalOnly loads the samples up to step stop . Loads until the end if no value is passed step : int Only loads every n th sample, where n is given by step .Default: 1 The get_autocorrelation

Function

Calculates the autocorrelation of the samples (after burn in) for a given t , where t is the shiftin auto_corr[ t ] = X i x ( i )¯ x ( i + t ) (22)The function takes the following two arguments: path : str Either the path to an HDF5 ﬁle or a path to the directory where the HDF5 ﬁle(s) arestored. If ( path) is a directory, the latest ’run’ ﬁle (with respect to the ﬁle name) insaid directory is loaded. shift : int The shift t as described above.Default: 16 HMC Sampling for Fields field_name : str, optionalIf MultiFields were used, this argument is mandatory and speciﬁes which of the sub-ﬁelds is supposed to be used.The function returns a 1+ n dimensional NumPy ndarray where the ﬁrst dimension representsthe diﬀerent chains and the other n dimensions the dimensionality of the problem. Aﬃliation:

Christoph LienhardMax Planck Institut for AstrophysicsKarl-Schwarzschild-Str. 1D-85741 Garching, Germany and

Ludwig-Maximilians-Universität MünchenGeschwister-Scholl-Platz 180539 Munich, Germany and

Technische Universität MünchenArcisstr. 2180333 Munich, GermanyE-mail: [email protected]

Torsten EnßlinMax Planck Institut for AstrophysicsKarl-Schwarzschild-Str. 1D-85741 Garching, Germany and

Ludwig-Maximilians-Universität MünchenGeschwister-Scholl-Platz 180539 Munich, Germany

Journal of Statistical Software published by the Foundation for Open Access Statistics

MMMMMM YYYY, Volume VV, Issue II

Submitted: yyyy-mm-dd doi:10.18637/jss.v000.i00