DDynamic Gauss Newton Metropolis Algorithm
The MCMC Jagger
Mehmet Ugurbil
New York University, Courant Institute of Mathematical Sciences251 Mercer StreetNew York, NY 1001201/2016A thesis submitted in partial fulfillmentof the requirements for the degree ofMaster of ScienceDepartment of MathematicsNew York University01/2016Jonathan Goodman a r X i v : . [ s t a t . C O ] D ec ontents bstract GNM: The MCMC Jagger. A rocking awesome sampler.This python package is an affine invariant Markov chain Monte Carlo (MCMC)sampler based on the dynamic Gauss-Newton-Metropolis (GNM) algorithm.The GNM algorithm is specialized in sampling highly non-linear posterior prob-ability distribution functions of the form e −|| f ( x ) || / , and the package is animplementation of this algorithm.On top of the back-off strategy in the original GNM algorithm, there is thedynamic hyper-parameter optimization feature added to the algorithm and in-cluded in the package to help increase performance of the back-off and thereforethe sampling. Also, there are the Jacobian tester, error bars creator and manymore features for the ease of use included in the code.The problem is introduced and a guide to installation is given in the intro-duction. Then how to use the python package is explained. The algorithm isgiven and finally there are some examples using exponential time series to showthe performance of the algorithm and the back-off strategy.2 ection 1: Introduction The GNM algorithm is for sampling generalizations of probability densities ofthe form p ( x ) ∝ e −|| f ( x ) || / . Here x ∈ R n is a “parameter” vector, f : R n → R m is a “model” function, and || f ( x ) || = (cid:80) mk =1 f k ( x ). Typically m > n . GNM isa powerful algorithm for ill-conditioned problems because it is affine invariant.This is related to nonlinear least squares min x || f ( x ) || that arises in statis-tics. For example, if m ( x ) predicts the outcome of an experiment with parameter x , and y k are measurements, then the goodness of fit is (cid:80) mk =1 ( m k ( x ) − y k ) σ k . If f k ( x ) = m k ( x ) − y k σ k , then this sum has the form || f ( x ) || / p ( x ) is adding a Gaus-sian prior probability distribution on the parameters, π : R n → R , such that π ( x ) = Z e − ( x − m ) T H ( x − m ) / , where m is the mean and H is the precision ma-trix, the inverse of the covariance. Another generalization is having an indicatorfunction χ : R n → { , } to limit the domain of the model function.The user of the sampler must write Python code to evaluate χ ( x ), f ( x )and ∇ f ( x ), the Jacobian. It is optional to add prior information, mean vector m ∈ R n and precision matrix H ∈ R n × n (inverse of the covariance matrix). The gnm is a Python package. The Python package numpy is required beforethe execution of gnm . To use the examples and plot the results, you will need matplotlib . To use the acor feature, you will need the package acor .From the default Python packages, you will need os , setuptools (or distu-tils ), re , sys , copy , and json . These packages likely come with your Pythoninstallation. 3he easiest way to install gnm would be to use pip.$ pip install gnm To download manually, use git clone or download as zip from https://github.com/mugurbil/gnm .$ git clone https :// github . com / mugurbil / gnm . git Then you can install manually by going into the gnm directory, and thenrunning setup.py .$ python setup . py install To clean the repository after installation, one can run clean with setup.py .$ python setup . py clean ection 2: Code The user needs to create Python code that contains the model information. ThisPython function, which we will call the user function, should have two inputs,and three outputs. This code will be passed to the sampler.The first input should be the parameter vector, x ∈ R n . The second inputshould be the arguments that the function takes, which may include the data.The first output is the constraint χ ( x ). It is supposed to tell where the modelfunction is defined. Thus, it would have value 1 (or True) when the functionand its derivative are defined, and 0 (or False) otherwise. If D is the domainof f and ∇ f , then χ ( x ) = x ∈ D x (cid:54)∈ D . The second output is an array ofsize m of f evaluated at x , f ( x ) ∈ R m (range), that should be convertible toan np.array . The third output is an array of size m by n that is the Jacobianmatrix, ∇ f , evaluated at x , J ( x ) ∈ R m × n . This also needs to be convertible toan np.array . Listing 1: User defined function outline. def model (x , args ): ... return chi (x), f(x), J(x) The sampler needs the user function ( model ), arguments the user functiontakes ( args ), and the initial guess of the sampler ( x 0 ). This information isprovided to the sampler during initialization. The function will be evaluated atthe initial guess to check if it is defined at that point.5isting 2: Sampler initialization. jagger = gnm . sampler ( x 0 , model , args ) The prior information is optional. For a Gaussian prior distribution, meanvector m ( m ) and precision matrix H ( H ) need to be provided.Listing 3: Setting the prior. jagger . prior (m , H) The
Jtest method is to check whether the derivative information and the modelfunction given in the user function match by employing numerical differentia-tion. It has two required inputs, six optional inputs, and one output.The domain, D J , for Jtest should be a rectangle given by two vectors, thetop right and the bottom left corners of the rectangle, x max and x min respec-tively. These vectors are the two required inputs. Note that they should havethe same dimensions, and all values of x min should be less than the correspond-ing values of x max so that the domain is nonempty. D J = { x : ∀ i, ( x min ) i < x i < ( x max ) i } The method has one output, error , that tells whether the numerical Jaco-bian, Df , converged to the Jacobian supplied by the user function, ∇ f . Theoutput is 0 if the convergence occurred. Otherwise, it is the norm of the differ-ence of the numeric and provided Jacobians, || Df − ∇ f || , at the point it failedto converge. Listing 4: Jtest usage. error = jagger . Jtest ( x min , x m a x ) assert error == 0 The six optional inputs of
Jtest and their default values are given below. Wethen explain how the method works and where these variables come into use.Table 1:
Jtest optional inputs.SYMBOL IN CODE DEFAULT VALUE dx dx · − N N (cid:15) max eps max · − p p l max l max r r Jtest chooses a i.i.d. uniform random variable, x k , in its specified domain.Then it calculates the numerical Jacobian at that point by the symmetric dif-ference quotient. This operation is done for all entries in the Jacobian, for j = 1to n , and for each j , i = 1 → n . Note that there is no i in the code since theoperations are done in vectors.( D ( l ) f ( x k )) ij = f i ( x k + δ x ( l ) j ) − f i ( x k − δ x ( l ) j )2 δ ( l ) j ≈ ∂f i ∂x j ( x k )Here δ x ( l ) j = δ ( l ) j ∗ e j , where e j has 0s everywhere but a 1 at the j th entry,and δ ( l ) j is the magnitude of the perturbance in e j direction at the l th stage.The initial perturbance magnitude, δ (0) j , depends on the domain and dx by theequation δ (0) j = ( x max − x min ) j ∗ dx . 7he numerical Jacobian is compared under p norm to the Jacobian given bythe user function evaluated at x k , (cid:15) ( l ) k = || D ( l ) f ( x k ) − ∇ f ( x k ) || p . If this error, (cid:15) ( l ) k , is smaller than (cid:15) max , this point passes the differentiation test. Otherwise,the magnitudes of the perturbations are reduced by a multiplication with r : δ ( l ) j = δ ( l − j ∗ r = δ (0) j ∗ r l , ∀ j .This is repeated for the same point a maximum amount of l max times, sothat l ≤ l max . If the point is still generating an error greater than (cid:15) max , theJacobian is said to be incorrect. This is because the numerical Jacobian is notconvergent to the provided Jacobian as we decrease δ x repeatedly, δ x → . Inthis case, the program returns (cid:15) ( l max ) k , the final error at this point.This procedure is repeated for N different points. If all the points pass thedifferential test, then the method returns 0.Listing 5: Advanced Jtest usage. error = jagger . Jtest ( x min , x max ,N=5 , dx =0.001) There are currently two types of back-off strategies, static and dynamic . Staticway is by fixing the maximum number of back-offs as well as the back-off dilationfactor. Dynamic way is by fixing the max number of back-offs but changing thedilation factor based on the current position of the sampler.Listing 6: Setting back-off. jagger . static (5 , 0.2) jagger . dynamic (3) .6 Sampling Once all the information is given to the sampler, all it requires is to call the sample method with the number of samples wanted ( n samples ).The sampling process can be divided into sessions. If you run the samplemethod again, the sampler will remember where it left off and continue thesampling from there. Also, you can divide sampling n samples into n divs bysetting the option divs = n divs .If used on terminal with n divs >
1, to see the progress of sampling, visual option could be set to
True . Turning visual to True shows the percentage ofthe sampling completed.The sampling can be done in safe mode by setting safe = True . This willcause the sampler to save progress at every division, and once the sampling iscompleted. Listing 7: Sampling. jagger . sample ( n samples , divs = n divs , visual = True , safe = True ) The initial guess might be far away from the region where the chain is sup-posed to be, causing delay before the chain converges to the desired distribution.After the sampling, these undesired initial samples can be burned (discarded) bycalling burn and providing the number of samples to be burned ( n burned ).Listing 8: Burning samples. jagger . burn ( n b u r n e d ) .7 Outputs Outputs of the sampler can be accessed after the sampling as properties. Theseoutputs are provided in table 2. The main output is the chain , which containsthe Markov chain of the samples. This is an np.array of shape ( n samples , n ).Thus, it has length equal to the number of samples and each sample is a vectorof size n containing the position of that sample.Listing 9: Information access. chain = jagger . chain Table 2: Outputs.CODE EXPLANATION chain
The Markov chain n samples
Number of elements sampled n accepted
Number of samples accepted accept rate
Acceptance rate of the algorithm call count
Number of function calls step count
Array of size max steps indicatinghow many samples got acceptedat each back-off stepThere is a method called error bars for ease of use. It evenly divides thespace given by a rectangle into bins and sums up the number of samples thatfall into each bin. Also, it estimates the error bars for each bin. It requires threeinputs: an integer specifying the number of bins in each dimension ( n dims ), avector specifying the minimum of each dimension of the rectangle ( d min ), anda vector specifying the maximum ( d max ). These vectors need to be castableto np.array and have dimension n . The method produces three outputs: the10stimate of the posterior ( p x ), the estimate of the error bars ( err ), and the mid-point location of each bin ( x ). All three of these outputs are of form np.array and of shape ( n, n dims ). Listing 10: Error bars. x , p x , err = jagger . er ror bar s ( n dims , d min , d m a x ) ection 3: Algorithm The Gauss-Newton-Metropolis algorithm and the back-off strategy is taken fromZhu [1]. The dynamic back-off is original.
We cannot sample from the posterior distribution directly as it may be highlynon-linear. Instead, we can create an approximation, such as a Gaussian ap-proximation, from which we can sample directly. For this purpose, as in theGauss-Newton algorithm, we are going to replace the value of the function f byits approximation. That is, if we are sampling z from x , we are going to replace f ( z ) by its approximation around x : f ( x ) + ∇ f ( x )( z − x ). Hence, we get theproposal distribution: K ( x, z ) = 1 Z π ( z ) e −|| f ( x )+ ∇ f ( x )( z − x ) || / What may not be clear is that this approximation is Gaussian and thereforecan be sampled directly. Up to a factor of − , the exponential part of thisproposal is: ( z − m ) T H ( z − m ) + || f ( x ) + ∇ f ( x )( z − x ) || After some algebra, as a function of z , this may be reduced to: z T ( H + ||∇ f ( x ) || ) z − z T ( Hm − ∇ f ( x ) T f ( x ) + ||∇ f ( x ) || x ) + O (1)Therefore, we can complete the square to get ( z − µ ) T P ( z − µ ) by setting: P = ( H + ||∇ f ( x ) || ) and µ = P − ( Hm − ∇ f ( x ) T f ( x ) + ||∇ f ( x ) || x )12hus, we get that K ( x, z ) ∼ N ( µ, P ), which is Gaussian and can be sampleddirectly. Then we use the Metropolis-Hastings algorithm to achieve our originalgoal of sampling p ( x ). We need to have an acceptance probability A to satisfydetailed balance: p ( x ) K ( x, z ) A ( x, z ) = p ( z ) K ( z, x ) A ( z, x )Therefore, A ( x, z ) = min (cid:26) , p ( z ) K ( z, x ) p ( x ) K ( x, z ) (cid:27) Algorithms that are not globally convergent may have trouble at points thatare away from the optimum. The proposal at these points would not be goodapproximations of the distribution that is being simulated. This leads to a lowacceptance probability, and hence a large autocorrelation time. This situationis undesirable and needs to be overcome. Back-off strategy is a way to overcomethis difficulty.In Markov chain Monte Carlo algorithms, first a proposal is made, then it isaccepted according to an acceptance recipe. If it is rejected, the next element inthe chain is the same as the previous element. Thus, the chain remains rigid inplace. To overcome this, rather than rejecting a sample that is not accepted, theproposal distribution is dilated and a new proposal is made. This is motivatedby step size control methods used in line search.We need to generalize the detailed balance condition and update the accep-tance recipe accordingly to accommodate for the back-off. This can be achievedby the “very detailed balance” and the new acceptance recipe: A ( x, z n | z , ..., z n − ) = min (cid:26) , p ( z n ) K ( z n , z )[1 − A ( z n , z )] · · · K n ( z n , x ) p ( x ) K ( x, z )[1 − A ( x, z )] · · · K n ( x, z n ) (cid:27) A better way to choose the step size is to note where we want the algorithmto sample most. This would be the area with the highest probability, where p ( x ) has the highest value. We can approximately achieve this by minimizing || f ( x ) || since p ( x ) is inversely proportional to it: p ( x ) = χ ( x ) 1 Z π ( x ) e −|| f ( x ) || / ∝ e −|| f ( x ) || / We can minimize || f ( x ) || in a search direction by polynomial interpolation[2]. If ( z − x ) is the search direction, then we want to minimize φ ( t ) = || f ( x + t ( z − x )) || over t . This gives φ (cid:48) ( t ) = 2 f ( x + t ( z − x )) ∇ f ( x + t ( z − x )) · ( z − x ).So we know the four values φ (0), φ (1), φ (cid:48) (0), and φ (cid:48) (1) that can be used toemploy cubic interpolation and find the approximate minimum. This is used asour dilation factor. 14 ection 4: Examples The quickstart.py file in ∼ /gnm/examples directory contains a simple example.Running this file should create a plot that looks like Figure 1.$ python quickstart . py In this example, n = m = 1, f ( x ) = x − yσ , f (cid:48) ( x ) = xσ . The prior has mean m = 0 and precision matrix H = [1], giving π ( x ) = Z e − x / . Therefore, theprobability distribution that we want to sample is: p ( x ) = 1 Z π ( x ) e −| f ( x ) | / = 1 Z e − x / − ( x − y ) / (2 σ ) Figure 1: Sampling results for the simple well problem.15 .2 Well
The problem in quick-start can be made harder by deepening the well, that istaking y to be bigger. This example is in the ∼ /well folder. The following linewill give all the options for the file that you can play around with.$ python well . py − h In this example we first check the usage of Jtest, then sample a simple 2Dexample. There is a rotator for visiulization so that we can plot any cut of theposterior probability distribution along the plane.
Suppose we have a process that describes the relationship between time and databy g ( t ) = (cid:80) ni =1 w i e − λ i t . We have measurements y k = g ( t k )+ (cid:15) k for k = 1 , ..., m ,where the noise is independent and identically distributed and (cid:15) k ∼ N (0 , σ k ).Then the parameter is x = ( w , ..., w d , λ , ..., λ d ), with n = 2 d . The modelfunction is given by f k ( x ) = g k ( t,x ) − y k σ k for k = 1 , ...m .Experiment was done for n = 4 and m = 10. The sampler was run for10 samples and first 2000 was burned in. The mean of the prior was taken tobe [4 ., ., . , . ] and the precision matrix was chosen as the identity multipliedwith 0 .
5. The initial guess is the mean of the prior. Data is created by takingthe input as [1 ., . , . , .
1] and adding random normal noise with standarddeviation 0 . λ versus w , is shown in Figure 2. Figure 3 presents a 1D marginal histogram of the16igure 2: 2D marginal posterior probability function for exponential time seriesfor variables λ versus w . w )012345678 E x p o n e n t ( λ ) Exp Time Series Posterior Probability probability distribution sampled, for λ . This figure also displays quadrature(theoretical) curve versus the sampled probability function where we can observethe correctness of the algorithm.In this experiment we see significant gains using back-off both in the accep-tance rate and in the autocorrelation time after analyzing Table 4, however thisconclusion does not hold for the dynamic back-off. Since lower autocorrelationtime implies higher effective sample size, we have for instance 8850 effectivesamples with 1 back-off step versus 5952 effective samples for no back-off steps.These translate to . ∗ = 0 . . ∗ = 0 . . ∗ = 0 . λ . λ P r o b a b ili t y Histogram (n=1e+06, P(A)=0.291) quadraturesampled
In Figure 4, we can observe that the covariance of the chain is stable implyingthat the chain is stationary. We can see in Figure 5 the percentage of samplesaccepted at each step (-1 is reject). This example shows that using too manyback-off steps does not help significantly, however adding the first back-off stepincreases the acceptance rate of the algorithm significantly, around 30%.18able 3: Comparison of sampling strategies for sampling exponential time serieswith different back-off strategies.
Max Dilation Acceptance Acor Effective Number ofSteps Factor Rate Time Size Function Calls Figure 4: Covariance of the first component of the markov chain. C ( t ) Covariance Function for λ P e r c e n t a g e Step of Acceptence ection 5: References [1] B. Zhu. Gauss-Newton-Metropolis with backup strategy . Courant Instituteof Mathematical Sciences, New York University, New York, NY, 2013.[2] J. Nocedal and S. Wright.
Numerical Optimization . Springer, New York,2000.[3] J. Goodman and J. Weare.
Ensemble samplers with affine invariance .CAMCS, Vol. 5 (2010), No. 1, pp. 6580.[4] W. R. Gilks, S. Richardson and D. J. Spiegelhalter.