11 Review on Parameter Estimation in HMRF
Georgia Institute of Technology
Namjoon Suh Abstract This is a technical report which explores the estimation methodologies on hyper-parameters in Markov Random Field and Gaussian Hidden Markov Random Field. In first section, we briefly investigate a theoretical framework on Metropolis-Hastings algorithm. Next, by using MH algorithm, we simulate the data from Ising model, and study on how hyperparameter estimation in Ising model is enabled through MCMC algorithm using pseudo-likelihood approximation. Following section deals with an issue on parameters estimation process of Gaussian Hidden Markov Random Field using MAP estimation and EM algorithm, and also discusses problems, found through several experiments. In following section, we expand this idea on estimating parameters in Gaussian Hidden Markov Spatial-Temporal Random Field, and display results on two performed experiments. Review on Metropolis β Hastings algorithm
Metropolis-Hastings algorithm is a Markov Chain Monte Carlo method for obtaining a sequence of random samples from a probability density for which direct sampling is difficult. In other words, constructing a Markov Chain which converges to the target distribution ππ ( π₯π₯ ) is enabled by Metropolis-Hastings algorithm. Their basic idea is as follows: In order for the Markov Chain { π₯π₯ , π₯π₯ , π₯π₯ , π₯π₯ β¦. } to converge to the target density, ππ ( π₯π₯ ), we should be able to know how to construct a transitional kernel, whose nth (n β β ) iteration converge to the invariant target distribution. Let a ππ (y| π₯π₯ ) be a candidate-generating density ( or proposal density ). This density is to be interpreted as saying that when a process is at the point π₯π₯ , the density generates a new candidate data y . If ππ ( π¦π¦ | π₯π₯ ) , itself, satisfies βDetailed balance conditionβ for all the x, y βππ , π₯π₯ β π¦π¦ . ( ππ denotes a continuous state space, where the Markov Chain is defined), then our search is over. But in most cases, it wonβt. Ο (x) ππ ( π¦π¦ | π₯π₯ ) > Ο (y) ππ ( π₯π₯ | π¦π¦ ) (1) In above case, speaking somewhat loosely, the process moves from π₯π₯ to π¦π¦ too often, and from π¦π¦ to π₯π₯ rarely. Metropolis and Hastings introduced an βacceptance probabilityβ allowing the Markov Chain to move towards the direction where the process can satisfy the βDetailed Balance conditionβ, so that the data x and y can be generated from target distribution, Ο (. ) . Following is the result of introducing the βacceptance probabilityβ, satisfying the detailed balance equation. Ο (x) ππ ( π¦π¦ | π₯π₯ ) πΌπΌ ( π₯π₯ , π¦π¦ ) = Ο (y) ππ ( π₯π₯ | π¦π¦ ) πΌπΌ ( π¦π¦ , π₯π₯ ) (2) Here πΌπΌ ( π₯π₯ , π¦π¦ ) denotes that we accept the movement from x to y with probability πΌπΌ ( π₯π₯ , π¦π¦ ) . The above inequality, (1), tells us that movement from y to x is not made enough. We should therefore define πΌπΌ ( π¦π¦ , π₯π₯ ) as large as possible. Since it is a probability, its upper limit is 1. If we plug in 1 at πΌπΌ ( π¦π¦ , π₯π₯ ) in (2), the πΌπΌ ( π₯π₯ , π¦π¦ ) can be defined as follows: πΌπΌ ( π₯π₯ , π¦π¦ ) = min {1, Ο ( y ) ππ ( π₯π₯ | π¦π¦ ) Ο ( x ) πποΏ½ π¦π¦ οΏ½ π₯π₯ οΏ½ } In this case, we generate a random number from U(0,1) , then if the acceptance probability πΌπΌ ( π₯π₯ , π¦π¦ ) is bigger than the newly generated random number, then we allow the movement from x to y, otherwise, point x would stay at the same point x. Here, we only consider the case of π₯π₯ β π¦π¦ , therefore the transition kernel can be defined as follows: πΎπΎ ( π₯π₯ βΌ π¦π¦ ) = ππ ( π¦π¦ | π₯π₯ ) πΌπΌ ( π₯π₯ , π¦π¦ ) if π₯π₯ β π¦π¦ To complete the definition of the Metropolis-Hastings chain, we must consider the possibly non-zero probability that the process remains at the x. In order to explain this concept more clearly, we would like to redefine the transitional kernel as follows: Transition kernel, πΎπΎ ( π₯π₯ βΌ π΅π΅ ), is a conditional distribution function that represents the probability of moving π₯π₯ to a point in a set (or interval) B. Transition kernel of the x being transferred to the region B can be expressed as follows: (Let B = dy). πΎπΎ ( π₯π₯ βΌ π΅π΅ ) = β©βͺβ¨βͺβ§ ππ ( π΅π΅ | π₯π₯ ) πΌπΌ ( π₯π₯ , π΅π΅ ), π₯π₯ β π΅π΅ β οΏ½ ππ ( π₯π₯ β² | π₯π₯ ) πΌπΌ ( π₯π₯ , π₯π₯ β² ) π₯π₯ β² βππ \ π΅π΅ , π₯π₯ β π΅π΅ Here, a candidate generating density ( or proposal density ) is denoted ππππ ππ ( π΅π΅ | π₯π₯ ) . This density is saying that when a process is at the point π₯π₯ , the density generates a candidate data in the region B. πΌπΌ ( π₯π₯ , π΅π΅ ) is a probability of accepting the βproposalβ of π₯π₯ being moved to the region B. ππ denotes a continuous state space, where the markov chain is defined. So, as can be seen from the above definition of πΎπΎ ( π₯π₯ βΌ π΅π΅ ) , this transition instinctively can be separated in two cases. First, when the π₯π₯ does not originally belong to the region B, we first propose a move from π₯π₯ to a region B, then accept the proposal with probability πΌπΌ ( π₯π₯ , π΅π΅ ) . Inversely, when π₯π₯ is in the region B, the only thing that we need to do is to subtract the probability of newly moved point being in continuous state space ππ \B from 1. Above representation can be expressed neatly in one line as follows: πΎπΎ ( π₯π₯ βΌ π΅π΅ ) = ππ ( π΅π΅ | π₯π₯ ) πΌπΌ ( π₯π₯ , π΅π΅ ) + ππ ( π₯π₯ β π΅π΅ ) οΏ½ β οΏ½ ππ ( π₯π₯ β² | π₯π₯ ) πΌπΌ ( π₯π₯ , π₯π₯ β² ) π₯π₯ β² βππ οΏ½ (3) For simplicity of notation, we will put ππ ( π₯π₯ ) = 1 β β« ππ ( π₯π₯ β² | π₯π₯ ) πΌπΌ ( π₯π₯ , π₯π₯ β² ) π₯π₯ β² βππ . In order for ππ ( π₯π₯ ) to be an invariant distribution for transitional kernel πΎπΎ ( π₯π₯ βΌ π΅π΅ ) , following detailed balance condition should be satisfied. ππ ( π₯π₯ ) πΎπΎ ( π₯π₯ βΌ π΅π΅ ) = ππ ( π΅π΅ ) πΎπΎ ( π΅π΅ βΌ π₯π₯ ) If we take the integral with respect to x on the left hand-side for whole continuous state space ππ , we can get following equality (3) and if we are able to show that this equality holds for the transitional kernel defined above (2), we complete the proof of Metropolis-Hastings algorithm. β« πΎπΎ ( π₯π₯ βΌ π΅π΅ ) π₯π₯βππ ππ ( π₯π₯ ) πππ₯π₯ = ππ ( π΅π΅ ) (4) οΏ½ πΎπΎ ( π₯π₯ βΌ π΅π΅ ) π₯π₯βππ ππ ( π₯π₯ ) πππ₯π₯ = οΏ½ ππ ( π΅π΅ | π₯π₯ ) πΌπΌ ( π₯π₯ , π΅π΅ ) ππ ( π₯π₯ ) + ππ ( π₯π₯ β π΅π΅ ) ππ ( π₯π₯ ) ππ ( π₯π₯ ) π₯π₯βππ πππ₯π₯ = οΏ½ ππ ( π΅π΅ | π₯π₯ ) πΌπΌ ( π₯π₯ , π΅π΅ ) ππ ( π₯π₯ ) πππ₯π₯ π₯π₯βππ + οΏ½ ππ ( π₯π₯ ) ππ ( π₯π₯ ) π₯π₯βπ΅π΅ πππ₯π₯ = οΏ½ οΏ½ ππ ( π¦π¦ | π₯π₯ ) πΌπΌ ( π₯π₯ , π¦π¦ ) πππ¦π¦ π¦π¦βπ΅π΅ ππ ( π₯π₯ ) πππ₯π₯ π₯π₯βππ + οΏ½ ππ ( π₯π₯ ) ππ ( π₯π₯ ) π₯π₯βπ΅π΅ πππ₯π₯ = οΏ½ οΏ½ ππ ( π₯π₯ | π¦π¦ ) πΌπΌ ( π¦π¦ , π₯π₯ ) πππ₯π₯ π₯π₯βππ ππ ( π¦π¦ ) πππ¦π¦ π¦π¦βπ΅π΅ + οΏ½ ππ ( π₯π₯ ) ππ ( π₯π₯ ) π₯π₯βπ΅π΅ πππ₯π₯ = οΏ½ οΏ½ β ππ ( π¦π¦ ) οΏ½ππ ( π¦π¦ ) πππ¦π¦ π¦π¦βπ΅π΅ + οΏ½ ππ ( π₯π₯ ) ππ ( π₯π₯ ) π₯π₯βπ΅π΅ πππ₯π₯ = β« ππ ( π₯π₯ ) π₯π₯βπ΅π΅ πππ₯π₯ = ππ ( π΅π΅ ) (5) Note that we used a relation of (2) in the process of our derivation (5), proving that a M-H kernel has ππ ( π₯π₯ ) as its invariant density. In Metropolis-Hastings algorithm, calculation of πΌπΌ ( π₯π₯ , π¦π¦ ) doesnβt require the knowledge of normalizing constant in Ising model. And if the candidate generating density is symmetric, an important special case, ππ ( π₯π₯ | π¦π¦ ) = ππ ( π¦π¦ | π₯π₯ ) , then the probability of move from x to y reduces to ππππππ {1, ππ ( π¦π¦ ) ππ ( π₯π₯ ) } , which will lead to the facilitations the calculation. We will use this algorithmβs distinctive properties in following section for simulating the behavior of Ising model. Generate samples from 2D Ising model via Metropolis-Hastings algorithm
Consider Binary Random Field on N x N regular lattice in π π . Assume that each pixel at position (l, m) can take values either ππ ππππ = 1 or ππ ππππ = β . This double indexing can be translated into univariate indexing by (l , m) βΌ i = (N-1)*l + m. Let Z = { ππ ππ βΆ i β Ξ© } be variables to indicate the true segmentation of the image. We assume that Z is from a finite state Markov Random Field (MRF). Here, we only consider, for simplicity, the Ising model with first order neighborhood system and no external field: P( Z = z | π½π½ ) = 1 ππ ( ππ ) exp οΏ½οΏ½ ππ ππ οΏ½π½π½ οΏ½ ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ ππ ( ππ ) denotes a partition function (or normalizing constant), which is required in to create a valid probability distribution: ππ ( ππ ) = β exp οΏ½β ππ ππ οΏ½π½π½ β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ ππβ { β1 , +1 } | Ξ© | . Because of this notorious partition function which is computationally infeasible to get, we are unable to simulate the behavior of Ising model only by using the ordinary sampling scheme. Here we will use Metropolis-Hastings algorithm to sample from the target density, P( Z = z | π½π½ ) . In our model, the calculation of Ξ± (z, zβ² ) in Metropolis-Hastings does not require the knowledge of normalizing constant of P( Z = z | π½π½ ) , since it appears both at the denominator and at the nominator of Ξ± (z, zβ² ) . Since each pair of neighboring pixels corresponds to one edge inside N x N regular grid, there is N-1 vertical edges in each of N rows, i.e. N(N-1) vertical edges. Because of the symmetry, the number of horizontal edges is the same, and there is exactly 2N(N-1) pairs. The expression β ππ ππ οΏ½π½π½ β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© is 2N(N-1) - 2 ππ π§π§ where ππ π§π§ is the number of disagreeing edges (bordering pixels with values -1 and 1 in Z). So we can write it as follows P( Z = z | π½π½ ) β exp( β π½π½ππ π§π§ ) With this simplification, we can easily implement MH algorithm in our model. Start with the space of all configurations C in which each configuration Z is represented as a vector. Z = ( π§π§ π§π§ β¦ β¦ β¦ , π§π§ ππ , π§π§ ππ+1 , β¦ . . π§π§ ππ ) The MH algorithm has a following steps: 1.
Start with the random configuration Z. 2.
Select a random pixel π§π§ ππ Propose a new configuration Z β² , Zβ² = ( π§π§ π§π§ β¦ β¦ β¦ , βπ§π§ ππ , π§π§ ππ+1 , β¦ . . π§π§ ππ ) Calculate the acceptance probability πΌπΌ ( π§π§ , π§π§β² ) = min οΏ½ P οΏ½ Z= zβ² οΏ½ π½π½ ) ππ ( π§π§ β² | π§π§ )P ( Z=z | π½π½ ) ππ ( π§π§ | π§π§ β² ) οΏ½ . where g(z β² | π§π§ ) = οΏ½
0 z and z β² ππππππππππππ ππππ ππππππππ π‘π‘βππππ πππππππππ π ππ πππππππππππππππππ‘π‘ππ z and z β² ππππππππππππ πππππ π π¦π¦ ππππ πππππππππ π ππ πππππππππππππππππ‘π‘ππ Generate the random number between 0 and 1, u β U(0,1)
Accept newly proposed zβ if u β€ πΌπΌ ( π§π§ , π§π§β² ) , otherwise, stay with original z. 6. Iterates Step 1 ~ 5 until the samples converge to the stationary state. In Step 4, the acceptance rate can be easily computed as follows: Since proposal density is symmetry, it can be cancelled out from denominator and nominator. Then, let ππ π§π§ ππ denote the number of disagreeing edges between π§π§ ππ and the neighborhood of π§π§ ππ . If we flip the value of π§π§ ππ to π§π§ ππ β = - π§π§ ππ , we can easily notice that ππ π§π§ ππ β² is equivalent to ππ π§π§ ππ , which denotes the number of agreeing nodes between π§π§ ππ and its neighborhood nodes. Since P( Z = z | π½π½ ) β exp( β π½π½ ππ π§π§ ) , P ( Z=zβ² | π½π½ ) P ( Z=z | π½π½ ) is equivalent to πππ₯π₯ππ ( β π½π½ ( ππ π§π§ ππ β ππ π§π§ ππ )) . Following pictures are the simulation results of Ising Model on 125 x 125 lattice using Metropolis-Hastings algorithm by varying the values of betas. (beta = - 1, beta = 0.3, beta = 0.85). We illustrated the number of iterations for each simulation to be converged. From left to right and below, beta = -1, 0.3, 0.85 respectively.
Figure 1. Ising model with beta = -1 (Left above), 0.3 (Right above), 0.85 (Below)
Here, we can observe some interesting facts on the relationship between the values of betas and simulation results. As we can see above, when beta is -1, which means that each pixel is highly likely to have different signs with that of its neighboring nodes, the figure almost looks like having -1 and +1 alternatively for almost every pixel in the grid. Whereas, when betaβs value is greater than 0 and gets increased (when beta is equal to 0.3 and 0.85), each pixel is likely to have same values with that of its neighboring ones and we can more clearly see there are bigger white parts and bigger black parts in the grid as beta gets increased. In following section, we will explore the method on estimating the beta value with the given observation Z. Statistical Inference on Beta in 2D Ising model (MRF) using MCMC algorithm
The most fundamental problem we have in our estimation process stems from the normalizing constant in the model. We can intuitively think of using MCMC algorithm to estimate parameter π½π½ . However, given the observation Z, estimating beta requires the knowledge of normalizing constant ππ ( Ξ² ) in the process of getting the acceptance ratio πΌπΌ ( π½π½ , π½π½β² ) . ( i.e. both ππ ( Ξ² β² ) and ππ ( Ξ² ) cannot be cancelled out). Lei, et al (1999) introduced a brilliant way to solve this problem by utilizing the concept of pseudo-likelihood approximation (Besag, 1975) to the likelihood function we originally have. Figure 2. Pseudo-Likelihood in 2 x 3 Lattice
The PL approach is illustrated at Figure 2. for a 2d grid. We learn to predict each node, given all its neighbors. This objective is generally fast to compute since each full conditional density P( ππ ππ | ππ βππ , Ξ² ) only requires summing over the states of a single node ππ ππ in order to compute local normalization constant. Following is the specification of PL-approximation to our original likelihood function. P(Z = z | Ξ² ) β οΏ½ P(Z = π§π§ ππ | π§π§ βππ , Ξ² ) ππβΞ© = οΏ½ P οΏ½ Z = π§π§ ππ οΏ½ π§π§ ππ ( ππ ) , Ξ²οΏ½ ππβΞ© P(Z = z | Ξ² ) β οΏ½ exp οΏ½ππ ππ οΏ½π½π½ β ππ ππππβππ ( ππ ) οΏ½οΏ½β exp οΏ½ππ ππ οΏ½π½π½ β ππ ππππβππ ( ππ ) οΏ½οΏ½ ππ ππ β { β1 , } ππβΞ© Since our primary purpose is to estimate the parameter given the observation (
ππΜ ), the likelihood function can be written as follows: P( Ξ² |Z = π§π§Μ ) β PL( Ξ² |Z = π§π§Μ ) = οΏ½ exp οΏ½ππΜ ππ οΏ½π½π½ β ππΜ ππππβππ ( ππ ) οΏ½οΏ½β exp οΏ½ππ ππ οΏ½π½π½ β ππΜ ππππβππ ( ππ ) οΏ½οΏ½ ππ ππ β { β1 , } ππβΞ© Note that, in process of calculating the denominator of each
P(Z = π§π§ ππ | π§π§ βππ , Ξ² ) , provided that we know the values of given neighbors, only thing we need to do is to sum up the states of single node ππ ππ . Now, we can easily sample Markov Chain of π½π½ from the approximated density, PL( Ξ² |Z = π§π§Μ ) , using M-H algorithm. At each time t, the next state π½π½ π‘π‘+1 is sampled from the proposal distribution, g( π½π½ | π½π½ π‘π‘ ) . This generated sequence of dependent random variables, { π½π½ , π½π½ , π½π½ , π½π½ β¦. } , will gradually converge to the stationary distribution Ο ( Ξ² ) after a sufficiently large number of iterations (n). If we consider a burn-in period, say m iterations, by the law of large number, we are able to estimate the value of true parameter Ξ² by Ξ²οΏ½ = E( Ξ² |Z = π§π§Μ ) = 1 ππ β ππ οΏ½ Ξ² ππππππ=ππ+1 This equation shows how the Markov chain of Ξ² can be used to estimate E( Ξ² |Z = π§π§Μ ) . Such a Markov Chain can be constructed by Metropolis-Hastings algorithm (Gilks, et al 1996). Let the proposal distribution g( π½π½ β² | π½π½ π‘π‘ ) = g( π½π½ π‘π‘ | π½π½β² ) be the normal centered on the current value, π½π½ π‘π‘ . Then, the acceptance probability, πΌπΌ ( π½π½ π‘π‘ , π½π½ β² ) , can be easily calculated as follows: πΌπΌ ( π½π½ , π½π½ β² ) = min οΏ½
1, exp οΏ½π π ππππ
PL( Ξ² β² |Z = π§π§Μ ) β π π ππππ PL( Ξ² π‘π‘ |Z = π§π§Μ ) οΏ½οΏ½ = min(1, exp ( οΏ½ ππΜ ππ οΏ½Ξ² β² οΏ½ ππΜ ππππβππ ( ππ ) οΏ½ ππβΞ© β οΏ½ ππΜ ππ οΏ½π½π½ π‘π‘ οΏ½ ππΜ ππππβππ ( ππ ) οΏ½ ππβΞ© β οΏ½ log οΏ½ exp οΏ½ππ ππ οΏ½Ξ² β² οΏ½ ππΜ ππππβππ ( ππ ) οΏ½οΏ½ ππ ππ β { β1 , } ππβΞ© + οΏ½ log οΏ½ exp οΏ½ππ ππ οΏ½Ξ² t οΏ½ ππΜ ππππβππ ( ππ ) οΏ½οΏ½ ππ ππ β { β1 , } ) ππβΞ© ) If the acceptance ratio, πΌπΌ ( π½π½ , π½π½ β² ) , is greater than or equal to the randomly generated number u ~ Unif [0, 1] , then we accept the newly proposed π½π½ β² as π½π½ π‘π‘+1 . Otherwise, π½π½ π‘π‘+1 stays at previous beta, π½π½ π‘π‘ . The Metropolis β Hastings algorithm can be summarized as following procedure.
We performed several experiments to validate this methodology. On 125 x 125 lattice, we simulated Ising model with five different beta parameters: 1, 0.85, 0.3, -0.25, -1. The iteration numbers for the simulated data to be converged were somewhere around 2,000,000 for all five parameters. Since there is no definite convergence criteria, we had to check on our eyes whether behavior of Ising model changes or not as iteration number grows. In following graphs, we can check Markov Chain of π½π½ converge within less than 300 iteration numbers when beta is equal to -1 (Left) and 1 (Right), respectively. Figure 3. Monte-Carlo output of Markov Chain { π½π½ , π½π½ , π½π½ , π½π½ β¦. } when π½π½ = β In order to get acceptable parameters, the MCMC procedure described above should be repeated until stability of the Markov Chain is reached. The choice of starting points π½π½ , will not affect the stationary distribution if the chain is irreducible. Hence, here we start the algorithm with three different starting points: -2, 4, 8. The Markov chains converge in less than 300 iterations. The usual informal approach to detection of convergence is visual inspection of plots of the Monte-Carlo output { π½π½ , π½π½ , π½π½ , π½π½ β¦. } (see Figure 3). Here we set burn-in period m = 500 times and n = 1000 times of total iterations. Table 1 displays the result of estimations for three different starting points with 5 different parameter settings. Parameter Estimation for MRF using Metropolis-Hastings Algorithm. Initialize π½π½ ; set t = 0 and T = Maximum number of Iteration Start the Iteration while t < T Sample π½π½β² ~ ππ ( π½π½ π‘π‘ , 1) Generate a uniform random number:
U ~ unif(0,1) If U < πΌπΌ ( π½π½ , π½π½ β² ) then π½π½ π‘π‘+1 = π½π½ β² Else π½π½ π‘π‘+1 = π½π½ π‘π‘ t <- t + 1 End the Iteration Table 1. Estimation result of π½π½ in Markov Random Field using Pseudo-likelihood approximation Parameter estimation and image segmentation in HMRF with Spatial data.
Sometimes images are temporarily distorted due to various technical reasons. And true images(Z) are hidden under the noise data(Y). Here we study the methodology on how to estimate both the true image (Z) and parameters ( ΞΈ ) of the model at the same time. Here we use the term βSegmentationβ. Segmentation is a process which divides an image into several homogenous regions with smooth boundary. For smooth boundary, we use HMRF, which models spatial coherence between classes with MRF. For simplicity, we use the same Ising model as what we had defined in section 1: P( Z = z | π½π½ ) = 1 ππ ( ππ ) exp οΏ½οΏ½ ππ ππ οΏ½π½π½ οΏ½ ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ (6) where ππ ( ππ ) is a normalizing constant called the partition function, and each pixel at position i can take values either ππ ππ = 1 or ππ ππ = β . Given Z, each component in Y is conditionally independent and has a distribution. g(Y|Z = z, ππ , ππ β1 , ππ ) = οΏ½ g( ππ ππ |Z ππ = π§π§ ππ , ππ , ππ β1 , ππ ) ππβΞ© Here we assume g( β |Z ππ = π§π§ ππ , ππ , ππ β1 , ππ ) to be the normal distribution N( Β΅ , ππ ) with Β΅ = ππ if Z ππ = 1 and Β΅ = ππ β1 if Z ππ = β . The neighborhood of an interior point i in Ξ© consists of four sites nearest to i in the 2D lattice Ξ© . We simulated the data from above model with π½π½ = 0.2 , ππ = 2 , ππ β1 = 0 , ππ = 1 in 125 x 125 lattice. Following figure plots a simulated dataset consisting of { Z ππ | ππ β Ξ© } and {Y ππ | ππ β Ξ© }. Figure 4. Simulated Z (left) and Y (Right) on 125 x 125 Lattice
Initial Point Beta = -1 Beta = -0.25 Beta = 0.3 Beta = 0.85 Beta = 1 -2 -0.9837 -0.2478 0.2952 0.7987 1.0096 4 -0.9784 -0.2473 0.3070 0.8241 1.0025 8 -0.9816 -0.2340 0.3044 0.8230 1.0196 0
It shows spatial dependency clearly in {Z ππ | ππ β Ξ© }, but not in {Y ππ | ππ β Ξ© }. With this simulated Y, we will now investigate the methodology on how to restore the original Z and parameter values of the model. Overview of estimating procedure for the problem
We first need to observe how the likelihood function of our model looks like. Joint density function of
Y = {Y ππ | ππ β Ξ© } and Z = {Z ππ | ππ β Ξ© } is as follows: For the simplicity of notation, we will denote ΞΈ = ( ππ , ππ ) , where ππ = π½π½ and ππ = ( ππ , ππ β1 , ππ ) . ππ ππ ( ππ , β¦ , ππ Ξ© , ππ , β¦ , ππ Ξ© ) = ππ ππ ( ππ , β¦ , ππ Ξ© ){ οΏ½ ππ ππ ( ππ ππ | ππ ππ )} ππβΞ© Our primary purpose is to maximize the value of this likelihood function with respect to Z and ΞΈ . Since our model involves with latent variable, βZβ, we can intuitively think of using EM algorithm to obtain the MLE of the likelihood function. It consists of an expectation step (E-setp) and a maximization step (M-step) for each iteration. The E-step uses the current estimate ππ ( ππππππ ) to substitute for the conditional expectation of the complete-data log likelihood function: Q οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ = πΈπΈ ππ ( ππππππ ) [log ππ ππ ( ππ , ππ )| ππ ] . M-step can be carried out by taking partial-derivatives of Q οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ with respect to each component of ΞΈ and solving the estimating equation ππΜ οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ = 0 to obtain the updated estimate of ΞΈ ( ππππππ ) . To compute the value of expected likelihood function Q οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ in E-step, we also need to estimate or restore the most likely state of Z, given Y and ππ ( ππππππ ) . In other words, we use β maximum a posterioriβ (MAP) to estimate for the true segmentation of image. The solution to our problem can be recapped as the iterations of following two steps: MAP Estimation and EM algorithm (Step 1). Given estimate of ππ ( ππππππ ) , find the maximum a posteriori (MAP) estimate of Z, which is ππΜ
ππππππ = πππππππππππ₯π₯ ππ πποΏ½πποΏ½ππ , ππ ( ππππππ ) οΏ½ = πππππππππππ₯π₯ ππ ππ ππ ( ππππππ ) ( ππ , β¦ , ππ Ξ© ){ οΏ½ ππ ππ ( ππππππ ) ( ππ ππ | ππ ππ )} ππβΞ© (Step 2). Using the estimate of Z, ππΜ
ππππππ , approximate the expected log-likelihood function πΈπΈ ππ ( ππππππ ) [log ππ ππ ( ππ , ππ )| ππ ] and update the estimates of ππ as ππ ( ππππππ ) . (EM) MAP estimation of Z (Unobserved data) using Block Gibbs Sampler
Finding a MAP estimate in Step 1. can be computationally a very challenging task, and several Monte Carlo procedures have been proposed in statistical literature. As neatly pointed out by Lim et al (2007) βs paper, if we know the value of parameter ππ , the simplest procedure would be the iterative conditional mode (ICM), but since the likelihood of joint density is not a convex, it (ICM) is highly likely to be captured at local mode. Simulated Annealing had been suggested to resolve this local mode difficulty by choosing an appropriate tempering schedule. The most naΓ―ve procedure is, of course, based on samples from the posterior distribution, P ππ ( ππ | ππ ) . Here, as we did in section 1, we can use Metropolis β Hastings algorithm to estimate Z with the given parameter ππ . But this single-site update requires lots of time for the data to be converged to the stationary state. It usually takes more than 1,000,000 times of iterations until the model to be converged. This procedure plays a pivotal role in determining the complexity of algorithm. Furthermore, sampling procedure is repeatedly used in estimating Ξ² with newly updated Ξ² for each iterations of EM. If these steps take long time to be completed, the entire time for the algorithm to be terminated should be exponentially increased. Block Gibbs Sampler provides a powerful tool to estimate the posterior distribution of unobserved data(Z) given the observed data (Y) in an HMRF context. As pointed out by Liu (2001, pp. 130 β 131), the Gibbs Sampler is a Markov Chain that converges geometrically to its stationary distribution, which is the posterior distribution of interest. The convergence rate of Z is dependent upon how ππ ππ is correlated with each other. This fact leads Liu et al. (1994) to improve the efficiency of the Gibbs sampler by grouping highly correlated components to sample the blocks iteratively from their joint conditional distribution by the Block Gibbs Sampler. Figure 5. Neighborhood system of 2x2 regular block (Left) & how each block is updated without being overlapped with each other (Right)
Update the 1st block
Update the 2nd block
Update the last block The key here is to define the block and joint conditional distribution. Block can be of any form, and we used the simplest 2 x 2 regular fixed block. Inner blockβs neighborhood can be set as 8 points which are connected by edges with each element in the block. Since Block Gibbs scheme is used not only for state estimation but also for parameter estimation, we split the case into two of its usage: Block Gibbs Sampler for calculating πΈπΈ ππ ( ππππππ ) οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© | πποΏ½ and for calculating πΈπΈ π½π½ οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ . The one and only difference of these two procedures is on how to define the joint conditional distribution of each block. First, let us investigate the detailed procedure of Block Gibbs Sampler for posterior distribution.
Block Gibbs Sampler for posterior distribution
Let us denote the indexes of four nodes in ith block ( π΅π΅ ππ ) as ππ π΅π΅ ππ = { ππ ππ1 , ππ ππ2 , ππ ππ3 , ππ ππ4 } . By using conditional independence property of undirected graphical model, we can write a full conditional distribution of ππ π΅π΅ ππ as follows: P ππ ( πππ π ππ ) οΏ½ππ π΅π΅ ππ | ππΜ \ π΅π΅ ππ , ππ π΅π΅ ππ οΏ½ = P ππ ( πππ π ππ ) οΏ½ππ π΅π΅ ππ | ππΜ ππ ( π΅π΅ ππ ) , ππ π΅π΅ ππ οΏ½ = q ππ ( πππ π ππ ) οΏ½ππ π΅π΅ππ , πποΏ½ ππ ( π΅π΅ππ ) , ππ π΅π΅ππ οΏ½β q ππ ( πππ π ππ ) οΏ½ππ π΅π΅ππ , πποΏ½ ππ ( π΅π΅ππ ) , ππ π΅π΅ππ οΏ½ πππ΅π΅ππβ { β1 , +1 } Here, ππ ( π΅π΅ ππ ) is a set of indexes of nodes surrounding the 2 x 2 regular block. The term at the nominator can be expressed as follows: exp { οΏ½ οΏ½ππ ππππ β ππ ππ ππππ ( ππππππ ) οΏ½ β ππ ( ππππππ ) + π½π½ ( ππππππ ) ( οΏ½ ππ ππππ ππ ππππ ( ππ , ππ ) βπΈπΈ ππ + οΏ½ ππ ππππ οΏ½ ππ ππ οΏ½ )} ππβπΏπΏ ππππ where πΈπΈ ππ denotes a set of edges in π΅π΅ ππ and πΏπΏ ππππ = ππ ( π΅π΅ ππ ) β© ππ ( ππ ππππ ) . The first term considers total conditional dependencies between ππ π΅π΅ ππ and its corresponding observations ππ π΅π΅ ππ . The second term takes both the interaction between nodes within the block and the interaction between the blockβs nodes and its neighborhood nodes into account. There are 16 possible combinations of ππ π΅π΅ ππ , since each element in ππ π΅π΅ ππ can takes 2 values . Calculate the CDF value of this density, P ππ ( ππππππ ) οΏ½ππ π΅π΅ ππ | ππΜ \ π΅π΅ ππ , ππ π΅π΅ ππ οΏ½ , and pick the kth case out of 16, if CDF value corresponding to kth case first exceed the randomly generated number, u ~ unif(0,1) . As figure 5 indicates, update the next block in a way that there are no overlaps of elements between its neighboring blocks. Perform same update until all blocks in lattice are updated. We count this as a one iteration of Block Gibbs Sampler of posterior distribution. We performed 10,000 iterations and considers first 5,000 iterations as Burn-in period. With these remaining 5,000 lattice samples, we can calculate πΈπΈ ππ ( ππππππ ) οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© | πποΏ½ and πΈπΈ ππ ( ππππππ ) [ ππ ( ππ ππ = 1)| ππ ] through Monte-Carlo approximation. Block Gibbs Sampler for Gibbs distribution
In a process of estimating hyper-parameter, Ξ² , which will be introduced in a following section, we used Newton-Raphson method. Since the method requires gradient and hessian of the likelihood function, we need to calculate both πΈπΈ π½π½ οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ and π£π£ππππ π½π½ οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ for every newly updated beta. We facilitated these calculations by using Block Gibbs Sampler. The only difference with that of conditional Block Gibbs is the form of full conditional distribution in a sense that we sample the samples from Gibbs distribution, not a posterior distribution. We define the full conditional distribution of Gibbs distribution (6) as follows: P π½π½ οΏ½ππ π΅π΅ ππ | ππΜ \ π΅π΅ ππ οΏ½ = P π½π½ οΏ½ππ π΅π΅ ππ | ππΜ ππ ( π΅π΅ ππ ) οΏ½ = q π½π½ οΏ½ππ π΅π΅ππ , πποΏ½ ππ ( π΅π΅ππ ) οΏ½β q π½π½ οΏ½ππ π΅π΅ππ , πποΏ½ ππ ( π΅π΅ππ ) οΏ½ πππ΅π΅ππβ { β1 , +1 } Since we donβt have to consider a conditional dependency between observed data and hidden data, the term at a nominator can be simply written as follows: exp { π½π½ ( ππππππ ) ( οΏ½ ππ ππππ ππ ππππ ( ππ , ππ ) βπΈπΈ ππ + οΏ½ ππ ππππ οΏ½ ππ ππ οΏ½ )} ππβπΏπΏ ππππ where πΈπΈ ππ denotes a set of edges in π΅π΅ ππ and πΏπΏ ππππ = ππ ( π΅π΅ ππ ) β© ππ ( ππ ππππ ) . Remaining sampling scheme is exactly same with that of posterior distribution. After obtaining enough lattice samples, we calculated the πΈπΈ π½π½ οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ and π£π£ππππ π½π½ οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ through Monte-Carlo approximation. Since we are done with MAP approximation (Step 1), we are to elaborate how these approximated expectation values are used in EM algorithm (Step 2) to solve our parameter estimation problem. Estimating parameters of Hidden Markov Random Field with Spatial data
The conditional expectation of complete data log-likelihood Q οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ can be written as follows: Q οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ = πΈπΈ ππ ( ππππππ ) οΏ½β ππ ππ οΏ½π½π½ β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© β log ππ ( π½π½ ) οΏ½πποΏ½ + πΈπΈ ππ ( ππππππ ) οΏ½ππ ( ππ ππ = 1) β [log ππ ππ ( ππβΞ© ππ ππ | ππ ππ = 1) + ππ ( ππ ππ = β ππ ππ ( ππ ππ | ππ ππ = β
1) ] οΏ½πποΏ½ = β ( ππ ) + β ( ππ ) where ππ = π½π½ and ππ = ( ππ , ππ β1 , ππ ) The estimation of ππ , ππ can be separable in a sense that ππ ππππ ππππ Q οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ = 0 . 4 Here we can get the conditional expectation of indicator functions as follows: πΈπΈ ππ ( ππππππ ) [ ππ ( ππ ππ = 1)| ππ ] = ππ ππ ( πππ π ππ ) ( ππ ππ = 1 | ππ ππ ) πΈπΈ ππ ( ππππππ ) [ ππ ( ππ ππ = β ππ ] = ππ ππ ( πππ π ππ ) ( ππ ππ = β | ππ ππ ) These values are calculated by conditional Block Gibbs Sampler of posterior distribution and used in process of estimating parameters of ππ . After taking expectation, likelihood functions of β ( ππ ) and β ( ππ ) are written as follows: β ( ππ ) = π½π½ πΈπΈ ππ ( ππππππ ) οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© | πποΏ½ β log ππ ( π½π½ ) β ( ππ ) = οΏ½ οΏ½ ππ ππ ( πππ π ππ ) ( ππ ππ = 1 | ππ ππ ) π π ππππππ ππ ( ππ ππ | ππ ππ = 1 ) + ππ ππ ( πππ π ππ ) ( ππ ππ = β | ππ ππ ) π π ππππππ ππ ( ππ ππ | ππ ππ = β ) οΏ½ ππβΞ© Estimation of ππ and ππ β1 This problem is equivalent with the inference problem of the simplest Gaussian Mixture Model (GMM) with two mixed normal distributions. The estimation of parameter of ππ is the solution to οΏ½ οΏ½ππ ππ ( ππππππ ) ( ππ ππ = 1| ππ ππ ) ππ π π ππππππ ππ ( ππ ππ | ππ ππ = 1) ππ ππ + ππ ππ ( ππππππ ) ( ππ ππ = β ππ ππ ) πππ π ππππππ ππ ( ππ ππ | ππ ππ = β ππ ππ οΏ½ ππβΞ© = 0 Therefore, we got the solution as follows: ππ = β ππ ππ ( πππ π ππ ) ( ππ ππ = 1 | ππ ππ ) β ππ ππ ππβΞ© β ππ ππ ( πππ π ππ ) ( ππ ππ = 1 | ππ ππ ) ππβΞ© Likewise, we also can get easily the case of ππ β1 . ππ β1 = β ππ ππ ( πππ π ππ ) ( ππ ππ = β | ππ ππ ) β ππ ππ ππβΞ© β ππ ππ ( πππ π ππ ) ( ππ ππ = β | ππ ππ ) ππβΞ© Estimation of ππ The estimation of ππ is the solution to the following equation: ππππ ππ Q οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ = 0 , which is equivalent with: οΏ½ οΏ½ππ ππ ( ππππππ ) ( ππ ππ = 1| ππ ππ ) ππ π π ππππππ ππ ( ππ ππ | ππ ππ = 1) ππ ππ + ππ ππ ( ππππππ ) ( ππ ππ = β ππ ππ ) πππ π ππππππ ππ ( ππ ππ | ππ ππ = β ππ ππ οΏ½ ππβΞ© = 0 After some tedious calculations, we can neatly write the solution of above equation in terms of ππ as follows: ππ = 1| Ξ© | οΏ½ { ππ ππ ( πππ π ππ ) ( ππ ππ = 1 | ππ ππ ) οΏ½ ππ ππ β ππ ( ππππππ ) οΏ½ ππβΞ© + ππ ππ ( πππ π ππ ) ( ππ ππ = β | ππ ππ ) οΏ½ ππ ππ β ππ β1 ( ππππππ ) οΏ½ } Estimation of hyper-parameter π½π½ The estimate of hyper-parameter π½π½ is the solution to: βΜ ( ππ ) = 0 ππππΞ² [ Ξ² πΈπΈ ππ ( ππππππ ) οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½πποΏ½ β log ππ ( π½π½ )] = 0 Above equation can be rewritten as follows: πΈπΈ ππ ( ππππππ ) οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½πποΏ½ = πππππ½π½ ππ ( π½π½ ) ππ ( π½π½ ) where ππ ( π½π½ ) = β exp οΏ½β ππ ππ οΏ½π½π½ β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ ππβ { β1 , +1 } | Ξ© | , Z = {Z ππ | ππ β Ξ© }. Here, ππ β² ( π½π½ ) ππ ( π½π½ ) = οΏ½ β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© exp οΏ½β ππ ππ οΏ½π½π½ β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½β exp οΏ½β ππ ππ οΏ½π½π½ β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ ππβ { β1 , +1 } | Ξ© | ππβ { β1 , +1 } | Ξ© | = πΈπΈ π½π½ οΏ½οΏ½ ππ ππ οΏ½ οΏ½ ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ The term ππ β² ( π½π½ ) ππ ( π½π½ ) is an unconditional expectation with respect to Gibbs distribution in (6). So MLE of Ξ² is the solution of following equation: πΈπΈ ππ ( ππππππ ) οΏ½οΏ½ ππ ππ οΏ½ οΏ½ ππ ππππβππ ( ππ ) οΏ½ ππβΞ© | πποΏ½ = πΈπΈ π½π½ οΏ½οΏ½ ππ ππ οΏ½ οΏ½ ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ LHS expectation can be approximated by Monte-Carlo sampling of posterior distribution, ππ ππ ( ππππππ ) ( ππ = π§π§ | ππ ) . RHS expectation can also be approximated by Monte-Carlo sampling of Gibbs Distribution in (1). LHS expectation is calculated once with respect to ππ ( ππππππ ) for every iteration of EM, and this should be considered as a constant at each updating procedure. Since above equation does not have a closed form solution, we have no option but to use recursive algorithm like Newton-Raphson or Bisection method. Here, we chose NR method over Bisection method, because it can be applied not only in Spatial HMRF model but also in Spatial-Temporal data in HMRF model. Parameter updating scheme is extremely simple: π½π½ ( π‘π‘ ) = π½π½ ( π‘π‘β1 ) β ββ²οΏ½π½π½ ( π‘π‘β1 ) οΏ½ββ²β² ( π½π½ ( π‘π‘β1 ) ) We can take double derivative with respect to π½π½ on likelihood function β ( ππ ) . β β²β² ( π½π½ ) = ππ ππ π½π½ { β log ππ ( π½π½ )} = β ππ β²β² ( π½π½ ) ππ ( π½π½ ) β οΏ½ππ β² ( π½π½ ) οΏ½ οΏ½ππ ( π½π½ ) οΏ½ = β οΏ½ππ β²β² ( π½π½ ) ππ ( π½π½ ) β οΏ½ππ β² ( π½π½ ) ππ ( π½π½ ) οΏ½ οΏ½ Then, we can immediately notice the term, ππ β² ( π½π½ ) ππ ( π½π½ ) , is πΈπΈ π½π½ οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ . First term can be calculated as follows: ππ β²β² ( π½π½ ) ππ ( π½π½ ) = οΏ½ { β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© } exp οΏ½β ππ ππ οΏ½π½π½ β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½β exp οΏ½β ππ ππ οΏ½π½π½ β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ ππβ { β1 , +1 } | Ξ© | ππβ { β1 , +1 } | Ξ© | = πΈπΈ π½π½ οΏ½οΏ½οΏ½ ππ ππ οΏ½ οΏ½ ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ οΏ½ So, double derivative of likelihood function w.r.t. π½π½ is equal to β β²β² ( π½π½ ) = β π£π£ππππ π½π½ οΏ½οΏ½οΏ½ ππ ππ οΏ½ οΏ½ ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½οΏ½ β€ Since β β²β² ( π½π½ ) is less than or equal to zero, β ( ππ ) is a concave function and global optimal structure is guaranteed in our problem. We update the beta through the following calculation: π½π½ ( π‘π‘ ) = π½π½ ( π‘π‘β1 ) + πΈπΈ ππ ( ππππππ ) οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½πποΏ½ β πΈπΈ π½π½ ( π‘π‘β1 ) οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ π£π£ππππ π½π½ ( π‘π‘β1 ) οΏ½οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½οΏ½ We terminate the algorithm when οΏ½π½π½ ( π‘π‘ ) β π½π½ ( π‘π‘β οΏ½ < 10 β3 . If we have a good initial point to start the algorithm, we may only do one or two steps NR procedure. Following is the summary of NR method in estimating parameter π½π½ in Hidden Markov Random Field. Note that ππ ( ππππππ ) is an already given set of parameters which are updated in each EM iteration. Estimation Result & Discussion
Table 2 gives the MLEs of ΞΈ = { ππ , ππ β1 , ππ , Ξ² } for a single simulated dataset for Ξ© = {1, β¦ , 48} x {1 , β¦ , 48}. To speed up the computation for the estimation, we reduced the size of lattice from 125 x 125 to 48 x 48. The convergence criteria used by the EM algorithm in section 3.3 is an upper bound of β3 for the size of each parameter increment. We performed several estimations with different parameter settings. The initial values of parameters were arbitrarily set and five estimations were executed in a range of [-1,0] and [0,1] respectively. The table reports the results of each estimated parameters. Each value in parenthesis ( β , β , β , β ) at first and seventh row of table denotes the true parameters of ππ , ππ β1 , ππ , Ξ² in a sequential order. The table also gives a CPU time (sec) carried out on a Laptop PC (Intel Core i7β5500 CPU(2.4GHz)). We found several interesting phenomena here through the results. Table 2. Results of parameter estimations in HMRF with different parameter settings
Newton-Raphson Method for estimating hyper-parameter, π½π½ . Given πΈπΈ ππ ( ππππππ ) οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½πποΏ½ Start the Iteration while οΏ½π½π½ ( π‘π‘ ) β π½π½ ( π‘π‘β οΏ½ < 10 β3 Calculate πΈπΈ π½π½ ( πππ π ππ ) β² οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½ & π£π£ππππ π½π½ ( πππ π ππ ) β² οΏ½οΏ½β ππ ππ οΏ½β ππ ππππβππ ( ππ ) οΏ½ ππβΞ© οΏ½οΏ½ by Block Gibbs Sampler of Gibbs Distribution. π½π½ ( ππππππ ) β² = π½π½ ( ππππππ ) β² β ββ²οΏ½π½π½ ( ππππππ ) β² οΏ½ββ²β²οΏ½π½π½ ( ππππππ ) β² οΏ½ End the Iteration Set π½π½ ( ππππππ ) = π½π½ ( ππππππ ) β²
48 x 48 (2, 1, 0.4, -0.2) (2,1,0.4, -0.5) (1.7, -1.5, 1.2, -0.7) (0.7, -0.5, 1.6, -0.9) (2,1,0.4, -1) mu1 mu-1 sigma beta -0.2144 -0.4779 -0.5811 -0.5305 -0.9789
CPU time
48 x 48 (2,0,1,0.2) (1.5, -1,1,0.5) (2,1,0.4,0.6) (2.3 -2 1.5 0.7) (2.3 -2 1.5 1) mu1 mu-1 -0.0631 -0.9986 1.0142 -2.0692 -2.2437 sigma
Beta
CPU time The EM algorithm works perfectly well when beta is set in a range of [-0.5, 0.5]. But outside of this region, the accuracy of beta estimation is reduced significantly. i.e. when beta is equal to -0.7 -0.9, 0.7, 1, NR poorly estimates beta values. 2.
When true betas are set outside the range of [-0.5, 0.5] and initial values of betas are set within [-0.5, 0.5], poorly estimated beta values converge to the certain range of beta. 3.
The estimations of ππ = { ππ , ππ β1 , ππ } and ππ = { Ξ² } are independent in a sense that ππ ππππ ππππ Q οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ = 0 . Check the table in cases where betas are poorly estimated. Algorithm estimates reasonably well all the parameters in ππ . 4. According to the initial values we set, sometimes, the algorithm doesnβt terminate. i.e. In a case when ΞΈ = {2, 1, 0.4, -1}, if we set π½π½ ( πππππππ‘π‘ππππππ ) = β . Since partition function is computationally infeasible to get, it is impossible for us to draw the likelihood function with respect to beta. Instead we drew β β² ( π½π½ ) (score function of beta) to clarify why the abovementioned phenomena happen. Within the range of beta from -2 to 2 with 0.1 interval, values of β β² ( π½π½ ) for 41 betas are plotted in below figures. Four score functions are plotted when the true betas are 0.2, 0.6, 0.7, 1 respectively. The shape of all four functions looks somewhat similar to that of inverted sigmoid functions: whereas the functionsβ values sharply drops within the range of [-0.5, 0.5], it is formulated flat in range of [-2, -0.5] βͺ [0.5, 2]. This shape of functions can explain well all the three (1, 2, 4) weird behaviors of estimation process. If we start the algorithm with π½π½ ( πππππππ‘π‘ππππππ ) β [ β , also if π½π½ ( π‘π‘π‘π‘π‘π‘ππ ) β [ β , then it will always succeed in estimating parameters precisely. But if π½π½ ( πππππππ‘π‘ππππππ ) β [ β , and π½π½ ( π‘π‘π‘π‘π‘π‘ππ ) β [ β , then the estimated beta will Figure 5. See the points of betas where score function meets zero line be the one where it first hits the β β² ( π½π½ ) = 0 . As seen in the figures above and displayed at above table, when π½π½ ( π‘π‘π‘π‘π‘π‘ππ ) = 0.6, 0.7, 1 then π½π½ ( πππππ‘π‘πππππππ‘π‘ππππ ) = 0.53, 0.60, 0.63 . In a case where π½π½ ( πππππππ‘π‘ππππππ ) β [ β , wherever π½π½ ( π‘π‘π‘π‘π‘π‘ππ ) lies, the algorithm never terminates, because Newton-Raphson method will diverge in a range of [-2, -0.5] βͺ [0.5, 2]. ( β΅ β β²β² ( π½π½ ) β ) From the data point of view, when hidden states (Z) become sparse binary, in other words, when beta goes beyond certain threshold, most of the simulated data go to either -1 or +1. Therefore, a simulated dataset, itself, cannot contain much information on value of true beta. With this dataset, if we use one parameter Ising model as is in our case, likelihood function is formulated too flat nearby the true beta value. For example, in cases where the true beta is set as 0.6, 0.7, 0.9, 1.5, 2, it is highly likely for us to have sparse binary hidden state for all the cases. This leads us to an extremely hard situation in which we are not able to tell the differences between these cases from a finite number of data. (Figure 5.) In recent studies, most researchers are too busy in utilizing HMRF without being conscious on theoretical aspects of HMRF. In most of the papers which are about statistical inferences on HMRF, researchers choose some βgoodβ beta ( π½π½ ( π‘π‘πππ‘π‘ππ ) β [ β ] ) , which produces reasonable amount of -1 and +1 in a dataset, and perform numerical studies. But in real data, most of the cases are not as good as what researchers are thinking. For instance, in genomic data, when βDifferentially Expressed Geneβ is sparse, then we have a problem in finding the network information of DEG. We are going to focus on addressing this sparse binary hidden state problem in our future research. Parameter estimation of Hidden Markov Spatio-Temporal Random Field.
So far, we have focused on spatial data at a fixed time t and have assumed the signal ππ π‘π‘ to be observable on fitting Markov random fields to these data. We now consider a more general setting of Hidden Markov Random Fields (HMRFs) of spatial-temporal data ππ π‘π‘ , β€ t β€ T , that are related to the signal ππ π‘π‘ via the conditional densities ππ ππ ( β | β ) such that given Z = { ππ π‘π‘ππ βΆ β€ t β€ T, Ο β Ξ© } , the ππ π‘π‘ππ are conditionally independent with density function ππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ ) . First order Markovian transition over time is assumed for ππ π‘π‘ so that the joint density β ππ ( β ) of ππ π‘π‘ is of the following form (7). β ππ (Z) = 1 ππ ( ππ ) exp οΏ½οΏ½ οΏ½ ππ π‘π‘ππππβΞ© οΏ½π½π½ οΏ½ ππ π‘π‘π‘π‘π‘π‘βππ ( ππ ) + πΌπΌππ π‘π‘β1 , ππ οΏ½ πππ‘π‘=1 οΏ½ (7) Here (7) basically augments the Ising model in the first paragraph of section 1 by adding the autoregressive term πΌπΌππ π‘π‘β1 , ππ at each voxel ππ and time t for the logarithmic link function for the binary time series. Due to the assumption of Markovian transition over time, joint density can be expressed as the product of conditional densities β ππ ( β | ππ π‘π‘β1 ) as follows: β ππ ( ππ , ππ , β¦ , ππ ππ ) = β ππ ( ππ ) β ππ ( ππ | ππ ) β ππ ( ππ | ππ , ππ ) β¦ . β ππ ( ππ ππ | ππ , ππ , β¦ , ππ ππβ1 ) = οΏ½ β ππ ( ππ π‘π‘ | ππ π‘π‘β1 ) πππ‘π‘=1
By using this relationship, likelihood function of observations over time t, ππ π‘π‘ = { ππ π‘π‘ππ , Ο β Ξ© } has the form ππ ππ ( ππ , ππ , β¦ , ππ ππ ) = οΏ½ ππ ππ ( ππ , β¦ , ππ ππ , ππ , β¦ , ππ ππ ) ππ , ππ ,β¦, ππ ππ = οΏ½ ππ ππ ( ππ , β¦ , ππ ππ | ππ , β¦ , ππ ππ ) ππ , ππ ,β¦, ππ ππ β ππ ( ππ , ππ , β¦ , ππ ππ ) = οΏ½ οΏ½ οΏ½οΏ½ ππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ ) ππβΞ© οΏ½ πππ‘π‘=1ππ , ππ ,β¦, ππ ππ οΏ½ β ππ ( ππ π‘π‘ | ππ π‘π‘β1 ) πππ‘π‘=1 = οΏ½ οΏ½ οΏ½οΏ½οΏ½ ππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ ) ππβΞ© οΏ½ β ππ ( ππ π‘π‘ | ππ π‘π‘β1 ) οΏ½ πππ‘π‘=1ππ , ππ ,β¦, ππ ππ in which β ππ ( ππ π‘π‘ | ππ π‘π‘β1 ) = β ππ ( ππ ) for the case t = 1. Here, since ππ ππ only depends on a sub-vector ππ of ππ and β ππ only depends on a sub-vector ππ of ππ , we differentiate the notation of two densities. We also assume here the model as Gaussian Hidden Markov Spatial-Temporal random field, where the conditional density ππ ππ ( β | ππ π‘π‘ππ ) is the normal distribution N( Β΅ , ππ ) with Β΅ = ππ if Z ππ = 1 and Β΅ = ππ β1 if Z ππ = β . (i.e. ππ = οΏ½ ππ , ππ β , ππ οΏ½ , ππ = ( Ξ± , Ξ² ) ) Since the estimation procedures are exactly same with that of what we had done at section 3, we will be going to provide a much simpler explanation on parameter estimation processes for HMRFs for spatial-temporal data. We kick off the problem by getting a conditional expectation of complete log-likelihood function. First, we define a complete log-likelihood function. log ππ ππ ( ππ , β¦ , ππ ππ , ππ , β¦ , ππ ππ ) = log οΏ½ οΏ½οΏ½οΏ½ ππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ ) ππβΞ© οΏ½ β ππ ( ππ π‘π‘ | ππ π‘π‘β1 ) οΏ½ πππ‘π‘=1 = log οΏ½ οΏ½ ππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ ) ππβΞ©πππ‘π‘=1 + log β ππ ( ππ , ππ , β¦ , ππ ππ ) = οΏ½ οΏ½ οΏ½ππ ( ππ π‘π‘ππ = 1) π π ππππππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ = 1) + ππ ( ππ π‘π‘ππ = β π π ππππππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ = β οΏ½ ππβΞ©πππ‘π‘=1 + π½π½ οΏ½ οΏ½ ππ π‘π‘ππππβΞ© οΏ½ ππ π‘π‘π‘π‘π‘π‘βππ ( ππ ) πππ‘π‘=1 + πΌπΌ οΏ½ οΏ½ ππ π‘π‘ππππβΞ© ππ π‘π‘β1 , πππππ‘π‘=1 β log ππ ( ππ ) where ππ ( ππ ) = β πππ₯π₯πποΏ½β β ππ π‘π‘ππππβΞ© οΏ½π½π½ β ππ π‘π‘π‘π‘π‘π‘βππ ( ππ ) + πΌπΌππ π‘π‘β1 , ππ οΏ½ πππ‘π‘=1 οΏ½ ππ ,β¦, ππ ππ β { β1 . } ππ | Ξ© | For simplicity, let us put ππ ( ππ ) = οΏ½ οΏ½ ππ π‘π‘ππππβΞ© οΏ½ ππ π‘π‘π‘π‘π‘π‘βππ ( ππ ) πππ‘π‘=1 ππ ( ππ ) = οΏ½ οΏ½ ππ π‘π‘ππππβΞ© ππ π‘π‘β1 , πππππ‘π‘=1 Then conditional expectation of complete data log-likelihood function can be expressed as follows: Q οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ = πΈπΈ ππ ( ππππππ ) [log ππ ππ ( ππ , ππ )| ππ ] Q οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ = β ( ππ ) + β ( ππ ) where ππ = ( ππ , ππ β1 , ππ ) and ππ = ( Ξ± , Ξ² ) βοΏ½ππ | ππ ( πππ π ππ ) οΏ½ = οΏ½ οΏ½ οΏ½ππ ππ ( ππππππ ) ( ππ π‘π‘ππ = 1| ππ π‘π‘ππ ) π π ππππππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ = 1) ππβΞ©πππ‘π‘=1 + ππ ππ ( ππππππ ) ( ππ π‘π‘ππ = β ππ π‘π‘ππ ) π π ππππππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ = β οΏ½ βοΏ½ππ | ππ ( ππππππ ) οΏ½ = π½π½ πΈπΈ ππ ( ππππππ ) { ππ ( ππ )| ππ } + πΌπΌ πΈπΈ ππ ( ππππππ ) { ππ ( ππ )| ππ } β log ππ ( ππ ) Here, also we can independently estimate the parameters of ππ and ππ , in a sense that ππ ππ ππ ππ ππ Q οΏ½ ΞΈ οΏ½ ππ ( πππ π ππ ) οΏ½ = 0 . 2 Estimation of ππ and ππ β1 Equivalent with the case above at sub-section 3.3βs first estimation procedure, we can easily get the MLE of ππ and ππ β1 as the solutions of following equations. οΏ½ οΏ½ οΏ½ππ ππ ( ππππππ ) ( ππ π‘π‘ππ = 1| ππ π‘π‘ππ ) ππ π π ππππππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ = 1) ππ ππ + ππ ππ ( ππππππ ) ( ππ π‘π‘ππ = β ππ π‘π‘ππ ) πππ π ππππππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ = β ππ ππ οΏ½ = 0 Therefore, we got the solution as follows: ππ = β β ππ ππ ( πππ π ππ ) ( ππ π‘π‘π‘π‘ = 1 | ππ π‘π‘π‘π‘ ) β ππ π‘π‘π‘π‘ ππβΞ©πππ‘π‘=1 β β ππ ππ ( πππ π ππ ) ( ππ π‘π‘π‘π‘ = 1 | ππ π‘π‘π‘π‘ ) ππβΞ©πππ‘π‘=1 Likewise, we also can get easily the case of ππ β1 . ππ β1 = β β ππ ππ ( πππ π ππ ) ( ππ π‘π‘π‘π‘ = β | ππ π‘π‘π‘π‘ ) β ππ π‘π‘π‘π‘ ππβΞ©πππ‘π‘=1 β β ππ ππ ( πππ π ππ ) ( ππ π‘π‘π‘π‘ = β | ππ π‘π‘π‘π‘ ) ππβΞ©πππ‘π‘=1 Estimation of ππ The estimation of ππ is the solution to the following equation: ππππ ππ Q οΏ½ΞΈοΏ½ππ ( ππππππ ) οΏ½ = 0 , which is equivalent with: οΏ½ οΏ½ οΏ½ππ ππ ( ππππππ ) ( ππ π‘π‘ππ = 1| ππ π‘π‘ππ ) ππ π π ππππππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ = 1) ππ ππ + ππ ππ ( ππππππ ) ( ππ π‘π‘ππ = β ππ π‘π‘ππ ) πππ π ππππππ ππ ( ππ π‘π‘ππ | ππ π‘π‘ππ = β ππ ππ οΏ½ = 0 After some tedious calculations, we can neatly write the solution of above equation in terms of ππ as follows: ππ = 1T| Ξ© | οΏ½ οΏ½ { ππ ππ ( ππππππ ) ( ππ π‘π‘ππ = 1| ππ π‘π‘ππ ) οΏ½ππ π‘π‘ππ β ππ ( ππππππ ) οΏ½ + ππ ππ ( ππππππ ) ( ππ π‘π‘ππ = β ππ π‘π‘ππ ) οΏ½ππ π‘π‘ππ β ππ β1 ( ππππππ ) οΏ½ } ππβΞ©πππ‘π‘=1 Estimation of hyper-parameter Ξ² and Ξ± We also apply Newton-Raphson Method for the estimation of two parameters Ξ² and Ξ± . NR step for two parameters can be defined as follows: οΏ½π½π½Μ ( π‘π‘ ) πΌπΌοΏ½ ( π‘π‘ ) οΏ½ = οΏ½π½π½Μ ( π‘π‘β1 ) πΌπΌοΏ½ ( π‘π‘β1 ) οΏ½ β π»π» ( π½π½Μ ( π‘π‘β1 ) , πΌπΌοΏ½ ( π‘π‘β1 ) ) β1 οΏ½ πππππ½π½ β οΏ½ ππ | ππ ( π‘π‘β ) οΏ½πππππΌπΌ β οΏ½ ππ | ππ ( π‘π‘β ) οΏ½οΏ½ π‘π‘βππππππ π»π»οΏ½π½π½Μ ( π‘π‘β1 ) , πΌπΌοΏ½ ( π‘π‘β1 ) οΏ½ = βββ ππ ππ π½π½ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ ππ πππ½π½πππΌπΌ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ππ πππ½π½πππΌπΌ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ ππ ππ πΌπΌ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ β ββ πππππ½π½ βοΏ½ππ | ππ ( π‘π‘β ) οΏ½ = πΈπΈ ππ ( ππππππ ) { ππ ( ππ )| ππ } β πππππ½π½ ππ ( ππ ( π‘π‘β1 ) ) ππ ( ππ ( π‘π‘β1 ) ) ππππ πΌπΌ βοΏ½ππ | ππ ( π‘π‘β ) οΏ½ = πΈπΈ ππ ( ππππππ ) { ππ ( ππ )| ππ } β ππππ πΌπΌ ππ ( ππ ( π‘π‘β1 ) ) ππ ( ππ ( π‘π‘β1 ) ) where, πππππ½π½ πποΏ½ππ ( π‘π‘β1 ) οΏ½πποΏ½ππ ( π‘π‘β1 ) οΏ½ = οΏ½ ππ ( ππ ) πππ₯π₯πποΏ½π½π½ ( π‘π‘β1 ) ππ ( ππ ) + πΌπΌ ( π‘π‘β1 ) ππ ( ππ ) οΏ½β πππ₯π₯ππ { π½π½ ( π‘π‘β1 ) ππ ( ππ ) + πΌπΌ ( π‘π‘β1 ) ππ ( ππ )} ππ ,β¦, ππ ππ β { β1 . } ππ | Ξ© | ππ ,β¦, ππ ππ β { β1 . } ππ | Ξ© | = πΈπΈ ππ ( π‘π‘β1 ) [ ππ ( ππ )] ππππ πΌπΌ πποΏ½ππ ( π‘π‘β1 ) οΏ½πποΏ½ππ ( π‘π‘β1 ) οΏ½ = οΏ½ ππ ( ππ ) πππ₯π₯πποΏ½π½π½ ( π‘π‘β1 ) ππ ( ππ ) + πΌπΌ ( π‘π‘β1 ) ππ ( ππ ) οΏ½β πππ₯π₯ππ { π½π½ ( π‘π‘β1 ) ππ ( ππ ) + πΌπΌ ( π‘π‘β1 ) ππ ( ππ )} ππ ,β¦, ππ ππ β { β1 . } ππ | Ξ© | ππ ,β¦, ππ ππ β { β1 . } ππ | Ξ© | = πΈπΈ ππ ( π‘π‘β1 ) [ ππ ( ππ )] Above two expectation values can be approximated through Monte Carlo samples from Gibbs distribution (2), generated at parameter setting . ππ ( π‘π‘β1 ) =( πΌπΌ ( π‘π‘β1 ) , π½π½ ( π‘π‘β1 ) ) . Here, we use Block gibbs sampler scheme to efficiently sample from Gibbs distribution. 3 β 2) Computation of Hessian Each element in Hessian can be calculated in similar fashion as follows: ππ ππ π½π½ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ = ππ ππ π½π½ οΏ½β log πποΏ½ππ ( π‘π‘β1 ) οΏ½οΏ½ = β β£β’β’β‘ πποΏ½ππ ( π‘π‘β1 ) οΏ½ β ππ πποΏ½ππ ( π‘π‘β1 ) οΏ½ππ π½π½ β οΏ½ πππππ½π½ πποΏ½ππ ( π‘π‘β1 ) οΏ½πποΏ½ππ ( π‘π‘β1 ) οΏ½ οΏ½ β¦β₯β₯β€ = β οΏ½ πΈπΈ ππ ( π‘π‘β1 ) [ ππ ( ππ ) ] β οΏ½πΈπΈ ππ ( π‘π‘β1 ) [ ππ ( ππ )] οΏ½ οΏ½ = β π£π£ππππ ππ ( π‘π‘β1 ) ( ππ ( ππ )) where the term ( π‘π‘β1 ) οΏ½ β ππ πποΏ½ππ ( π‘π‘β1 ) οΏ½ππ π½π½ can be calculated as follows : ππ πποΏ½ππ ( π‘π‘β1 ) οΏ½ππ π½π½πποΏ½ππ ( π‘π‘β1 ) οΏ½ = οΏ½ ππ ( ππ ) πππ₯π₯πποΏ½π½π½ ( π‘π‘β1 ) ππ ( ππ ) + πΌπΌ ( π‘π‘β1 ) ππ ( ππ ) οΏ½β πππ₯π₯ππ { π½π½ ( π‘π‘β1 ) ππ ( ππ ) + πΌπΌ ( π‘π‘β1 ) ππ ( ππ )} ππ ,β¦, ππ ππ β { β1 . } ππ | Ξ© | ππ ,β¦, ππ ππ β { β1 . } ππ | Ξ© | = πΈπΈ ππ ( π‘π‘β1 ) [ ππ ( ππ ) ] Similarly, ππ ππ πΌπΌ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ can be computed exactly same with ππ ππ π½π½ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ . ππ ππ π½π½ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ = β π£π£ππππ ππ ( π‘π‘β1 ) ( ππ ( ππ )) Lastly, ππ πππΌπΌπππ½π½ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ = ππ πππΌπΌπππ½π½ οΏ½β log πποΏ½ππ ( π‘π‘β1 ) οΏ½οΏ½ = β οΏ½ πποΏ½ππ ( π‘π‘β1 ) οΏ½ β οΏ½ππ πποΏ½ππ ( π‘π‘β1 ) οΏ½πππΌπΌ πππ½π½ β πππποΏ½ππ ( π‘π‘β1 ) οΏ½πππΌπΌ πππποΏ½ππ ( π‘π‘β1 ) οΏ½πππ½π½ οΏ½οΏ½ = β οΏ½ πΈπΈ ππ ( π‘π‘β1 ) [ ππ ( ππ ) ππ ( ππ )] β πΈπΈ ππ ( π‘π‘β1 ) [ ππ ( ππ )] β πΈπΈ ππ ( π‘π‘β1 ) [ ππ ( ππ )] οΏ½ = β πππππ£π£ ππ ( π‘π‘β1 ) ( ππ ( ππ ), ππ ( ππ ) ) The Hessian of the βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ can be defined in following form: π»π»οΏ½π½π½Μ ( π‘π‘β1 ) , πΌπΌοΏ½ ( π‘π‘β1 ) οΏ½ = βββ ππ ππ π½π½ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ ππ πππ½π½πππΌπΌ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ππ πππ½π½πππΌπΌ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ ππ ππ πΌπΌ βοΏ½ππ | ππ ( π‘π‘β1 ) οΏ½ β ββ = β οΏ½ π£π£ππππ ππ ( π‘π‘β ) ( ππ ( ππ ) ) πππππ£π£ ππ ( π‘π‘β ) ( ππ ( ππ ) , ππ ( ππ ) ) πππππ£π£ ππ ( π‘π‘β ) ( ππ ( ππ ) , ππ ( ππ ) ) π£π£ππππ ππ ( π‘π‘β ) ( ππ ( ππ ) ) οΏ½ π π norm of the vector, which is a difference between newly updated parameters and previous parameters before being updated, is less than or equal to ππ ππππ . In our estimation, we set ππ ππππ as β3 . οΏ½ ( π½π½Μ ( π‘π‘ ) β π½π½Μ ( π‘π‘β1 ) ) + ( πΌπΌοΏ½ ( π‘π‘ ) β πΌπΌοΏ½ ( π‘π‘β1 ) ) β€ ππ ππππ
Block Gibbs Sampler for posterior distribution (Spatio β Temporal)
Let us denote the indexes of four nodes in ith block at time t as ππ π΅π΅ πππ‘π‘ = { ππ ππ1π‘π‘ , ππ ππ2π‘π‘ , ππ ππ3π‘π‘ , ππ ππ4π‘π‘ } . By using conditional independence property of undirected graphical model, we can write a full conditional distribution of ππ π΅π΅ πππ‘π‘ with respect to posterior distribution of observations ππ π΅π΅ πππ‘π‘ as follows: P ππ ( πππ π ππ ) οΏ½ππ π΅π΅ πππ‘π‘ | ππΜ \ π΅π΅ πππ‘π‘ , πποΏ½ π΅π΅ πππ‘π‘ οΏ½ = P ππ ( πππ π ππ ) οΏ½ππ π΅π΅ πππ‘π‘ | ππΜ ππ ( π΅π΅ πππ‘π‘ ) , πποΏ½ π΅π΅ πππ‘π‘ οΏ½ = q ππ ( πππ π ππ ) οΏ½ππ π΅π΅πππ‘π‘ , πποΏ½ ππ ( π΅π΅πππ‘π‘ ) , πποΏ½
π΅π΅πππ‘π‘ οΏ½β q ππ ( πππ π ππ ) οΏ½ππ π΅π΅πππ‘π‘ , πποΏ½ ππ ( π΅π΅πππ‘π‘ ) , πποΏ½
π΅π΅πππ‘π‘ οΏ½ πππ΅π΅πππ‘π‘β { β1 , +1 } Here, ππ ( π΅π΅ πππ‘π‘ ) is a set of indexes of nodes surrounding the 2 x 2 regular block. The term at the nominator can be expressed as follows: exp β©βͺβͺβ¨βͺβͺβ§οΏ½ οΏ½πποΏ½ πππππ‘π‘ β ππ ππ πππππ‘π‘ ( ππππππ ) οΏ½ β ππ ( ππππππ ) + π½π½ ( ππππππ ) οΏ½ οΏ½ ππ πππππ‘π‘ ππ πππππ‘π‘ ( ππ , ππ , π‘π‘ )~( ππ , ππ , π‘π‘ ) + οΏ½ ππ πππππ‘π‘ οΏ½ ππ ππ οΏ½ ππβπΏπΏ πππππ‘π‘ οΏ½ + πΌπΌ ( ππππππ ) οΏ½ ππ πππππ‘π‘ οΏ½ππΜ ππππ , π‘π‘β1 + ππΜ ππππ , π‘π‘+1 οΏ½ ββͺβͺβ¬βͺβͺβ« where the symbol ( ππ , ππ , π‘π‘ )~( ππ , ππ , π‘π‘ ) indicates that nodes ( ππ , ππ , π‘π‘ ) ππππππ ( ππ , ππ , π‘π‘ ) are neighbors within the block π΅π΅ πππ‘π‘ , and πΏπΏ πππππ‘π‘ = ππ ( π΅π΅ πππ‘π‘ ) β© ππ ( ππ πππππ‘π‘ ) denotes an intersection of a set of indexes of neighboring nodes of block π΅π΅ πππ‘π‘ and a set of indexes of neighboring nodes of ππ πππππ‘π‘ . Note that the only difference of the form of full conditional distribution for block gibbs between spatial data and spatial-temporal data is that the absence or presence of the first order auto-regressive term. Since this is exactly same case with that of block gibbs sampler for spatial-temporal gibbs distribution, we will skip this part and move onto the result and discussion of our simulation result on spatial-temporal HMRF. Table 3. Results of parameter estimations in Spatial-Temporal HMRF
Estimation result for simulation study of Spatial-Temporal HMRF
We performed two simulation studies for the lattice size 24 x 24 x 90 and 48 x 48 x 90 respectively. Table 3 gives the MLE of ΞΈ = { ππ , ππ β1 , ππ , Ξ² , Ξ± } for a single simulated dataset. Here, we also generated each dataset by using 2 x 2 block gibbs sampling method. Since the heavy computational load of our algorithm, it was coded in C, not in MATLAB. The table also gives a CPU time (min) carried out on a Laptop PC (Intel Core i7 β 5500 CPU (2.4 GHz)). As the table displays, although the algorithm accurately estimates all the parameters, the time it takes to complete the estimation is quite long. Furthermore, though not displayed in the table, if either the parameter Ξ² or Ξ± goes beyond certain threshold, it will also produce spare binary state data, and we have a trouble in estimating both parameters. As for improving the speed of parameter estimation of spatial-temporal HMRF, Dr. Tze. L. Lai and Dr. Lim (2015) suggested a new solution of using block likelihood function by partitioning the whole dataset into smaller ones and compute the estimates in parallel. This method significantly reduces the estimation time and provides us with new insights on our next research directions. References Siddhartha Chib; Edward Greenberg, (1995): Understanding the Metropolis-Hastings Algorithm, American Statistical Association 2.
Lei W, Jun Liu, et al. (1999): MRF parameter estimation by MCMC method, Pattern recognition 3.
Martin J. Wainwright, Michael Jordan (2008): Graphical Models, Exponential Families, and Variational Inference, Foundation and Trends in Machine Learning. 4.
Tze Leung Lai, Johan Lim (2015) : Asymptotically Efficient Parameter Estimation in Hidden Markov Spatio-Temporal Random Fields, Statistica Sincia.
24 x 24 x 40 (0.3, 1.5, 0.49, 0.3, 0.1)
48 x 48 x 90 (0, 2, 1, 0.1, 0.1) mu1 mu1 -0.00982 mu-1 mu-1 sigma sigma beta beta alpha alpha
CPU time (min)
CPU time Johan Lim, Kyungsuk Pyun, Chee Sun Won (2007): Image Segmentation Using Hidden Markov Gauss Mixture Models, IEEE 6.
Besag, Bartolucci (2002) : A recursive algorithm for Markov Random Fields, Biometrika 7.
Ruslan : Learning in Markov Random Fields using Tempered Transitions 8.
Zhou, Leahy, Qi (1997) Approximate Maximum Likelihood Hyperparameter Estimation for Gibbs Priors. IEEE 9.