In the mathematical theory of stochastic processes, variable order Markov (VOM) models are an important type of model, which extend the well-known Markov chain model. Different from the Markov chain model, each random variable in the Markovian sequence depends on a fixed number of random variables, while in the VOM model, the number of these random variables can vary according to specific observation implementations. This sequence of observations is often called a context, and therefore, the VOM model is also called a context tree.
The flexibility of the VOM model lies in its changing number of conditional random variables, which gives it real advantages in many applications such as statistical analysis, classification and prediction.
For example, consider a sequence of random variables, each taking a value from the three-gram alphabet {a, b, c}. Specifically, consider a string consisting of the substring aaabc repeated infinitely: aaabcaaabcaaabc…aaabc. The VOM model with a maximum order of 2 can approximate the above string with the following five conditional probability components: Pr(a | aa) = 0.5, Pr(b | aa) = 0.5, Pr(c | b) = 1.0, Pr(a | c) = 1.0, Pr(a | ca) = 1.0.
In this example, Pr(c | ab) = Pr(c | b) = 1.0; therefore, the shorter context b is sufficient to determine the next character.
Similarly, the VOM model with a maximum order of 3 can accurately generate this string and requires only five conditional probability components, all of which have a value of 1.0. To build a Markov chain of order 1 for this string, 9 conditional probability components must be estimated: Pr(a|a), Pr(a|b), Pr(a|c), Pr(b|a ), Pr(b | b), Pr(b | c), Pr(c | a), Pr(c | b), Pr(c | c). To predict the next character in a Markov chain of order 2, 27 conditional probability components need to be estimated; in a Markov chain of order 3, 81 conditional probability components have to be estimated. In practice, there is usually not enough data to accurately estimate the number of conditional probability components that grows exponentially as the order of the Markov chain increases.
Variable sequential Markov models assume that, in real-life environments, the realization of certain states (represented by context) makes some past states independent of future states; therefore, the number of model parameters can be significantly reduced.
By definition, let A be a state space (finite alphabet) of size |A|. Consider a sequence x1^n = x1x2…xn with Markov properties, where xi ∈ A is the state (symbol) of the i-th position, and the concatenation of states xi and xi+1 is expressed as xix(i+1). Given a training set of observed states Specifically, the learner generates a conditional probability distribution P(xi | s) for symbols xi ∈ A, where s ∈ A*, where the * symbol represents a state sequence of arbitrary length, including empty context.
The VOM model aims to estimate the conditional distribution P(xi | s) whose context length |s| ≤ D varies according to the available statistical information. In contrast, the traditional Markov model assumes that the context length of these conditional distributions is fixed, i.e. |s| = D, and therefore can be regarded as a special case of the VOM model. For a given training sequence, the VOM model was found to be able to obtain better model parameterization than the fixed-order Markov model, thereby obtaining a better variance-bias balance in the learned model.
Various efficient algorithms have been developed to estimate the parameters of the VOM model, and the model has been successfully applied in fields such as machine learning, information theory, and bioinformatics.
These specific applications include encoding and data compression, document compression, classification and identification of DNA and protein sequences, statistical process control, spam filtering, monomer combination, speech recognition, and sequence analysis in the social sciences. For these applications, the variable sequential Markov model demonstrates its unique advantages and practical value.
In this way, the VOM model is not only a theoretical breakthrough, but its practical application also provides solutions to various challenges in the real world. In an ever-changing and complex data environment, how to more effectively predict future behaviors and trends? Can we rely on such a model?