The Baum-Welch algorithm determines the (locally) optimal parameters for a Hidden Markov Model by essentially using three equations.

One for the initial probabilities:

$$

\begin{align}

\pi_i &= \frac{E\left(\text{Number of times a sequence started with state}\, s_i\right)}{E\left(\text{Number of times a sequence started with any state}\right)}

\end{align}

$$

Another for the transition probabilities:

$$

\begin{align}

a_{ij} &= \frac{E\left(\text{Number of times the state changed from}\, s_i \, \text{to}\,s_j\right)}{E\left(\text{Number of times the state changed from}\, s_i \, \text{to any state}\right)}

\end{align}

$$

And the last one for the emission probabilities:

$$

\begin{align}

b_{ik} &= \frac{E\left(\text{Number of times the state was}\, s_i \, \text{and the observation was}\,v_k\right)}{E\left(\text{Number of times the state was}\, s_i\right)}

\end{align}

$$

If one had a fully labeled training corpus representing all possible outcomes, this would be exactly the optimal solution: Count each occurrence, normalize and you’re good. If, however, no such labeled training corpus is available – i.e. only observations are given, no according state sequences – the expected values \(E(c)\) of these counts would have to be estimated. This can be done (and is done) using the forward and backward probabilities \(\alpha_t(i)\) and \(\beta_t(i)\), as described below.

## Hidden Markov Models

Given a Hidden Markov Model \(\lambda\) as \(\lambda = \left\{\mathcal{S}; \mathcal{V}; \underline{A}; \underline{B}; \vec{\pi}\right\}\) with

$$ \begin{align} \mathcal{S} &= \left\{s_1; s_2; \cdots ; s_N\right\} \end{align} $$ |
The alphabet of all \(N\) possible states, i.e. values of the random variable \(X_t\) |

$$ \begin{align} \mathcal{V} &= \left\{v_1; v_2; \cdots ; v_M\right\} \end{align} $$ |
The alphabet of all \(M\) possible observations, i.e. emissions of the random variable \(Y_t\) |

$$ \begin{align} \underline{A} &\in \mathbb{R}^{N \times N} \end{align} $$ |
The state transition matrix with \(a_{ij}\) being the probability of moving from state \(s_i\) to \(s_j\) |

$$ \begin{align} \underline{B} &\in \mathbb{R}^{N \times M} \end{align} $$ |
The observation matrix with \(b_{ij} = b_i(v_j)\) being the probability of emitting the observation \(v_j\) when in state \(s_i\) |

$$ \begin{align} \vec{\pi} &\in \mathbb{R}^{n} \end{align} $$ |
The initial distribution with \(pi_i = P(X_1 = s_i)\), i.e. the probability that the 1st state is \(s_i\) |

then given the graph in the image above and assuming that the probability of staying in a state \(s_i\) at time \(t\) is simply \(P(X_t = s_i | X_{t-1} = s_i) = 1-\left(\sum_{j=1}^n{a_{ij}} \quad \forall j \neq i\right)\) (otherwise the sum of probabilities of leaving \(s_3\) would never be 1, as there is no edge in the graph leaving \(s_3\) – this is a problem with the image though), i.e.

$$

\begin{align}

a_{11} &= 1-(a_{12}) \\

a_{22} &= 1-(a_{21} + a_{23}) \\

a_{33} &= 1

\end{align}

$$

then the matrices \(\underline{A}\) and \(\underline{B}\) would look as follows;

$$

\begin{align}

\underline{A} &= \begin{bmatrix}

a_{11} & a_{12} & 0 \\

a_{21} & a_{22} & a_{23} \\

0 & 0 & a_{33}

\end{bmatrix} \\

\underline{B} &= \begin{bmatrix}

b_{11} & b_{12} & b_{13} & b_{14} \\

b_{21} & b_{22} & b_{23} & b_{24} \\

b_{31} & b_{32} & b_{33} & b_{34} %\\

%b_{41} & b_{42} & b_{43} & b_{44}

\end{bmatrix}

\end{align}

$$

The \(\vec{\pi}\) vector isn’t given so one might assume an equal distribution, i.e. \(\pi_i = \frac{1}{n} \quad \forall s_i \in \mathcal{S}\), such that

$$

\begin{align}

\vec{\pi} &= \begin{bmatrix}

1/3 \\

1/3 \\

1/3

\end{bmatrix}

\end{align}

$$

so that there is (or would be) an equal probability to start in any of the possible states.

## Initial, transition and emission probabilities

Given the model \(\lambda\), here are some probabilities to get us started. But first, some notations:

$$ P(X_t = s_i) $$ |
The probability of the state at time \(t\) being \(s_i\) |

$$ P(X_t = s_j | X_{t-1} = s_i) $$ |
The conditional probability of the state at time \(t\) being \(s_j\) given that the state at time \(t-1\) was \(s_i\) |

$$ P(X_t = s_i, Y_{t} = v_j) $$ |
The probability of the state at time \(t\) being \(s_i\) while seing an observation that is \(v_j\) |

The probability that the model starts (i.e. \(t=1\)) in a given state \(s_i\) is simply:

$$

\begin{align}

P(X_1 = s_i) &= \pi_i

\end{align}

$$

Also, the (conditional) probability that the model emits an observation \(v_j\) given that it is in a state \(s_i\) at time \(t\) is:

$$

\begin{align}

P(Y_t = v_j | X_t = s_i) &= b_{ij}

\end{align}

$$

The probability that the model starts in a given state \(s_i\) while emitting the observation $v_j$ then is:

$$

\begin{align}

P(X_1 = s_i, Y_1 = v_j) &= P(X_1 = s_i) \cdot P(Y_1 = v_j | X_1 = s_i) \\

&= \pi_i \cdot b_{ij}

\end{align}

$$

This is because the emission of an observation and the selection of a state are conditionally independent.

By definition, the (conditional) probability of moving from a state \(s_i\) at time \(t\) to \(s_j\) at time \(t+1\) is the probability of being at state \(s_j\) at time \(t\) given the state \(s_i\) at time \(t-1\), which simply is:

$$

\begin{align}

P(X_{t} = s_j | X_{t-1} = s_i) &= P(X_{t+1} = s_j | X_{t} = s_i) = a_{ij}

\end{align}

$$

Consequently, the conditional probability of moving from a state \(s_i\) to \(s_j\) while emitting the observation \(v_k\) is then:

$$

\begin{align}

P(X_t = s_j, Y_t = v_k| X_{t-1} = s_i) &= P(X_{t} = s_j | X_{t-1} = s_i) \cdot P(Y_t = v_k | X_t = s_j) \\

&= a_{ij} \cdot b_{jk}

\end{align}

$$

Finally, the conditional probability of starting in state \(s_i\) while emitting the observation \(v_k\), then transitioning to state \(s_j\) while emitting the observation \(v_l\) is

$$

\begin{align}

P(X_2 = s_j, Y_2 = v_l | X_1 = s_i, Y_1 = v_k) &= P(X_1 = s_i, Y_1 = v_k) \cdot P(X_2 = s_j, Y_2 = v_l | X_1 = s_i) \\

&= \pi_i \cdot b_{ik} \cdot a_{ij} \cdot b_{jl}

\end{align}

$$

To make things a bit easier later on, we define \(b_j(Y_t)\) as a function of the emission matrix \(\underline{B}\) such that

$$

\begin{align}

b_i(Y_t = v_j) &= P(Y_t = v_j, X_t = s_i) = b_{ij} \\

\Leftrightarrow Y_t &= v_j \,\implies \, b_i(Y_t) = P(Y_t, X_t = s_i) = b_{ij}

\end{align}

$$

## The forward probabilities

The forward probability \(\alpha_t(i)\) describes the probability of a (partial) observation sequence \(Y_1,Y_2, \cdots Y_t\) having been emitted while ending in state \(s_i\) at time \(t\), i.e.

$$

\begin{align}

\alpha_t(i) &= P(Y_1, Y_2, \cdots, Y_t, X_t = s_i)

\end{align}

$$

It can be seen intuitively that \(\alpha_1(i)\) is nothing but the probability of starting (i.e. at time \(t=1\)) with a state \(s_i\) while emitting a specific $Y \in \mathcal{V}$.

$$

\begin{align}

\alpha_1(i) &= P(Y_1, X_1 = s_i) = \pi_i \cdot b_{i}(Y_1), \qquad 1 \leq i \leq N

\end{align}

$$

Let’s assume the graph depicted in the image above implicitly contains the probabilities to stay in each node, i.e. $P(X_t = s_i | X_{t-1} = s_i) > 0 \, \forall s_i \in \mathcal{S}$ (which is missing in the image as there are no such edges).

Let’s also assume that we have seen the observation sequence \(Y_{1\cdots2} = \left\{v_3, v_1\right\}\) at times \(t=1\cdots 2\) (i.e. \(T=2\)) and that we are interested in the probability of being in state \(s_1\) after that observation.

Now there are multiple ways to reach state \(s_1\) at time \(t=2\) (note that there is no edge between \(s_3\) and \(s_1\)):

\(X_1 \to X_2\) | \(P(X_2|X_1)\) | \(P(X_2, Y_2 | X_1, Y_1)\) |
---|---|---|

\(s_1 \to s_1\) | \(a_{11}\) | \(\pi_{1} \cdot b_1(v_3) \cdot a_{11} \cdot b_1(v_1)\) |

\(s_2 \to s_1\) | \(a_{21}\) | \(\pi_{2} \cdot b_2(v_3) \cdot a_{21} \cdot b_1(v_1)\) |

\(s_3 \to s_1\) | \(a_{31}\) = 0 | \(\pi_{3} \cdot b_3(v_3) \cdot a_{31} \cdot b_1(v_1) = 0\) |

Now the total probability of reaching \(s_1\) while seing the emission \(v_1\) is the sum over all these path probabilities, which is \(\alpha_{t = 2}(i = 1)\):

$$

\begin{align}

\alpha_{2}(1) &= \left( \pi_{1} \cdot b_1(v_3) \cdot a_{11} \cdot b_1(v_1) \right) \\

&+ \left( \pi_{2} \cdot b_2(v_3) \cdot a_{21} \cdot b_1(v_1) \right) \\

&+ \left( \pi_{3} \cdot b_3(v_3) \cdot a_{31} \cdot b_1(v_1) \right) \\

&= \left( \underbrace{\pi_{1} \cdot b_1(v_3)}_{\alpha_1(1)} \cdot a_{11} + \underbrace{\pi_{2} \cdot b_2(v_3)}_{\alpha_1(2)} \cdot a_{21} + \underbrace{\pi_{3} \cdot b_3(v_3)}_{\alpha_1(3)} \cdot a_{31} \right) \cdot b_1(v_1) \\

&= \left( \sum_{i=1}^N \alpha_1(i) \cdot a_{i1} \right) \cdot b_1(v_1)

\end{align}

$$

which brings us to the recursive definition of the alpha function:

$$

\begin{align}

\alpha_1(i) &= \pi_i \cdot b_{i}(Y_1), &\qquad 1 \leq i \leq N \\

\alpha_t(j) &= \left( \sum_{i=1}^N \alpha_{t-1}(i) \cdot a_{ij} \right) \cdot b_j(Y_t), &\qquad 1 \lt t \leq T, 1 \leq j \leq N

\end{align}

$$

Note that for each time step \(t\) we are only interested in the probability of the event in the previous time step \(t-1\) (and not \(t-2\), etc.), which is due to the Markov assumption.

function [p] = alpha(lambda, Y, t, X) T = numel(Y); assert(1 <= t && t <= T, '1 <= t <= T'); % termination criterion if t == 1 p = lambda.pi(X) * lambda.B(X, Y(1)); return; end % recursive calculation p = 0; for j = 1:lambda.N p = p + alpha(lambda, Y, t-1, j) ... * lambda.A(j, X) * lambda.B(X, Y(t)); end end

## The backward probabilities

Now that we know how likely it is to observe an output sequence \(Y_{1 \cdots t} \) and, after that, be in a state \(s_i\) we are interested in obtaining the probability of observing a future sequence \(Y_{t+1 \cdots T}\) given that we actually are in state \(s_i\) at time \(t\), which is where the $\beta_t(i)$ function comes in. I found that function to be easier to understand if one doesn’t think of “future observation” as much as a prediction step, but more as a probability that a current model is able to emit a sequence (starting at \(t+1\)) when it comes to training. I’ll try to clarify with an example given the above graph.

Given three observations (i.e. emissions) \(Y=\left\{v_3, v_1, v_4\right\}\) (i.e. \(T=3\)) and that we are in state $X_{t=3} = s_1$ at time \(t=T\), what is the probability of emitting *any* future observation \(Y_{t=4}\).

$$

\begin{align}

\beta_3(1) &= P(Y_{4} = any|X_3 = s_1) \\

&= P(v_1|s_1) + P(v_2|s_1) + P(v_3|s_1) + P(v_4|s_1) \\

&= \sum_{j=1}^M{P(y_j|s_1)} = 1

\end{align}

$$

This works out because for every state there is exactly one emission by definition. Because of this, \(P(Y_{T+1} = any)\) is independent of the state and thus

$$

\begin{align}

\beta_T(i) &= P(Y_{T+1} = any|X_T = s_i) \\

&= P(Y_{T+1} = any) \\

&= 1

\end{align}

$$

Now stepping back to $t = T-1 = 2$; assuming that we are in state $X_2 = s_1$, the probability of emitting \(Y_{3} = v_4\) at \(t=3\) is

$$

\begin{align}

\beta_2(1) &= P(Y_{4} = any, Y_{3} = v_4 | X_2 = s_1) \\

&= P(Y_{4} = any, Y_{3} = v_4, X_3 = s_1 | X_2 = s_1) \\

&+ P(Y_{4} = any, Y_{3} = v_4, X_3 = s_2 | X_2 = s_1) \\

&+ P(Y_{4} = any, Y_{3} = v_4, X_3 = s_3 | X_2 = s_1)

\end{align}

$$

which leads to

$$

\begin{align}

\beta_2(1) &= P(Y_4=any|X_3=s_1,X_2=s_1) \cdot P(Y_3=v_4,X_3=s_1|X_2=s_1) \\

&+ P(Y_4=any|X_3=s_2,X_2=s_1) \cdot P(Y_3=v_4,X_3=s_2|X_2=s_1) \\

&+ P(Y_4=any|X_3=s_3,X_2=s_1) \cdot P(Y_3=v_4,X_3=s_3|X_2=s_1)

\end{align}

$$

by applying the Markov principle and noting that \(T=4\),

$$

\begin{align}

P(Y_4|X_3=s_i,X_2=s_j) &= P(Y_4|X_3=s_i) \\

&= \beta_3(i) = 1

\end{align}

$$

which yields, by taking into account all the possible following states,

$$

\begin{align}

\beta_2(1) &= \beta_3(1) \cdot P(Y_3=v_4,X_3=s_1|X_2=s_1) \\

&+ \beta_3(2) \cdot P(Y_3=v_4,X_3=s_2|X_2=s_1) \\

&+ \beta_3(3) \cdot P(Y_3=v_4,X_3=s_3|X_2=s_1) \\

&= \sum_{j=1}^N \left( P(Y_3=v_4,X_3=s_j|X_2=s_1) \cdot \beta_3(j) \right)\\

&= \sum_{j=1}^N \left( a_{1j} \cdot b_j(Y_3=v_4) \cdot \beta_3(j) \right)

\end{align}

$$

This brings us to the recursive form of the backward algorithm:

$$

\begin{align}

\beta_T(i) &= 1 \\

\beta_t(i) &= \sum_{j=1}^N \left( a_{ij} \cdot b_j(Y_{t+1}) \cdot \beta_{t+1}(j) \right)

\end{align}

$$

function [p] = beta(lambda, Y, t, X) T = numel(Y); assert(1 <= t && t <= T, '1 <= t <= T'); % termination condition for the recursive calculation if t == T p = 1; return; end % recursive calculation p = 0; for j = 1:lambda.N p = p + lambda.A(X, j) * lambda.B(j, Y(t+1)) ... * beta(lambda, Y, t+1, j); end end

## Evaluating the Hidden Markov Model

Using either forward or backward probabilities \(\alpha_t(i)\) and \(\beta_t(i)\), the HMM \(\lambda\) can be evaluated given an observation sequence \(Y = \left\{ Y_1, \cdots Y_T\right\}\).

$$

\begin{align}

P(Y) &= \sum_{i=1}^{N} \alpha_T(i) \\

&= \sum_{i=1}^{N} \pi_i \cdot b_i(Y_1) \cdot \beta_1(i)

\end{align}

$$

That is, the sum over all possibilities of generating that sequence while ending in any of the states (first case) or the sum over all possibilities of generating that sequence while starting in any of the states (second case). This probability of the model to generate the given observation sequence can – in the task of evaluating a trained model – be used as a measure of the model’s fitness.

function [p] = P(lambda, Y, forward) T = numel(Y); p = 0; if forward == true % solve using forward probabilities for i=1:lambda.N p = p + alpha(lambda, Y, T, i); end else % solve using backward probabilities for i=1:lambda.N p = p + lambda.pi(i) * beta(lambda, Y, 1, i) ... * lambda.B(i, Y(1)); end end end

In the presence of a whole corpus of observations \(\mathcal{C} = \left\{Y^1, Y^2, \cdots, Y^C\right\}\) the probability measure extends by multiplying the probabilities of each observation sequence \(Y^c \in \mathcal{C}\).

$$

\begin{align}

P(\mathcal{C}) &= \prod_{c=1}^{C} P(Y^c)

\end{align}

$$

The problem with this approach is that this product rapidly goes towards zero, resulting in either underflow or precision loss due to machine limitations. An alternative measure is the log-likelihood \(L(\mathcal{C})\), which uses the (natural) logarithm. Since

$$

\begin{align}

a \cdot b &= e^{\ln a + \ln b}

\end{align}

$$

it follows that

$$

\begin{align}

L(\mathcal{C}) &= \sum_{c=1}^{C} \ln P(Y^c)

\end{align}

$$

where

$$

\begin{align}

P(\mathcal{C}) &= e^{L(\mathcal{C})}

\end{align}

$$

Since the logarithm is a monotonic function, larger probabilities correspond to larger likelihoods and vice versa, which makes the \(L(\mathcal{C})\) function more feasible for the expectation maximization task.

## Calculating the expectation

Using forward and backward probabilities we are now able to calculate the expected number of times that the system is in either state \(s_i \in \mathcal{S}\) or that the system moves from state \(s_i \in \mathcal{S} \to s_j \in \mathcal{S}\).

To do so, we define some additional functions \(\gamma_t(i)\) and \(\xi_t(i, j)\) as

$$

\begin{align}

\gamma_t(i) &= \frac{\alpha_t(i) \cdot \beta_t(i)}{P(Y)} \\

\xi_t(i, j) &= \frac{\alpha_t(i) \cdot a_{ij} \cdot b_j(Y_{t+1})\cdot \beta_{t+1}(j)}{P(Y)} \\

\end{align}

$$

where

$$

\begin{align}

\gamma_t(i) &= \sum_{j=1}^N \xi_t(i, j)

\end{align}

$$

Here \(\gamma_t(i)\) is the probability of the system of being in state \(s_i\) at time \(t\) (given the observation sequence \(Y\)) and \(\xi_t(i, j)\) (that’s Xi) is the probability of being in state \(s_i\) at time \(t\) while being in state \(s_j\) at time \(t+1\), i.e. the transition probability. It’s natural that the sum of all probabilities of leaving a state \(s\) anywhere is equal to the probability of being *in* that state (because, obviously, you have to be in that state in order to leave it).

function [p] = gamma(lambda, Y, t, i) T = numel(Y); assert(1 <= t && t <= T, '1 <= t <= T'); p = alpha(lambda, Y, t, i) * beta(lambda, Y, t, i) / P(lambda, Y); end

function [p] = xi(lambda, Y, t, i, j) T = numel(Y); assert(1 <= t && t < T, '1 <= t < T'); p = alpha(lambda, Y, t, i) ... * lambda.A(i, j) * lambda.B(j, Y(t+1)) ... * beta(lambda, Y, t+1, j) ... / P(lambda, Y); end

## The Baum-Welch equations

Given these numbers, the Baum-Welch equations can be written as follows.

### For the initial probabilities

$$

\begin{align}

\tilde{\pi}_i &= \gamma_1(i)

\end{align}

$$

Whis is simple the probability of being in a given state at time \(t=1\).

### For the transition probabilities

$$

\begin{align}

\tilde{a}_{ij} &= \frac{\sum_{t=1}^{T-1} \xi_t(i, j)}{\sum_{t=1}^{T-1} \gamma_t(i)}

\end{align}

$$

Here, \(\sum_{t=1}^{T-1} \xi_t(i, j)\) is the probability of moving from state \(s_i\) to \(s_j\), where $\sum_{t=1}^{T-1} \gamma_t(i)$ is the probability of starting in state \(s_i\), which normalizes it.

### And for the emission probabilities

$$

\begin{align}

\tilde{b}_{ik} &= \frac{\sum_{t=1,Y_t = v_k}^{T} \gamma_t(i)}{\sum_{t=1}^{T} \gamma_t(i)} = \frac{\sum_{t=1}^{T} \begin{cases} 0,& \text{if}\, Y_t \neq v_k \\ \gamma_t(i),& \text{if}\, Y_t = v_k \end{cases}}{\sum_{t=1}^{T} \gamma_t(i)}

\end{align}

$$

Here \(\sum_{t=1,Y_t = v_k}^{T} \gamma_t(i)\) is the sum of probabilities of being in state \(s_i\) only where an observation $Y_t = v_k$ was made. Again, \(\sum_{t=1}^{T} \gamma_t(i)\) normalizes it.

### The optimized new HMM

Given these three parameters, we obtain an optimized Hidden Markov Model \(\tilde{\lambda} = \left\{\mathcal{S}; \mathcal{V}; \tilde{\underline{A}}; \tilde{\underline{B}}; \tilde{\vec{\pi}}\right\}\). These last steps are repeated until the difference in the measure \(P(Y)\), \(P(\mathcal{C})\) or \(L(\mathcal{C})\) is smaller than a given threshold (or the number of iteration exceeds a certain count, whichever happens first). And that’s that.