• ## Hidden Markov Model training using the Baum-Welch Algorithm

The Baum-Welch algo­rithm deter­mines the (local­ly) opti­mal para­me­ters for a Hid­den Markov Mod­el by essen­tial­ly using three equa­tions.

One for the ini­tial prob­a­bil­i­ties:

\begin{align}
\pi_i &= \frac{E\left(\text{Number of times a sequence started with state}\, s_i\right)}{E\left(\text{Number of times a sequence started with any state}\right)}
\end{align}

Anoth­er for the tran­si­tion prob­a­bil­i­ties:

\begin{align}
a_{ij} &= \frac{E\left(\text{Number of times the state changed from}\, s_i \, \text{to}\,s_j\right)}{E\left(\text{Number of times the state changed from}\, s_i \, \text{to any state}\right)}
\end{align}

And the last one for the emis­sion prob­a­bil­i­ties:

\begin{align}
b_{ik} &= \frac{E\left(\text{Number of times the state was}\, s_i \, \text{and the observation was}\,v_k\right)}{E\left(\text{Number of times the state was}\, s_i\right)}
\end{align}

If one had a ful­ly labeled train­ing cor­pus rep­re­sent­ing all pos­si­ble out­comes, this would be exact­ly the opti­mal solu­tion: Count each occur­rence, nor­mal­ize and you’re good. If, how­ev­er, no such labeled train­ing cor­pus is avail­able — i.e. only obser­va­tions are giv­en, no accord­ing state sequences — the expect­ed val­ues $E(c)$ of these counts would have to be esti­mat­ed. This can be done (and is done) using the for­ward and back­ward prob­a­bil­i­ties $\alpha_t(i)$ and $\beta_t(i)$ , as described below.

## Hidden Markov Models

Giv­en a Hid­den Markov Mod­el $\lambda$ as  $\lambda = \left\{\mathcal{S}; \mathcal{V}; \underline{A}; \underline{B}; \vec{\pi}\right\}$ with

 \begin{align} \mathcal{S} &= \left\{s_1; s_2; \cdots ; s_N\right\} \end{align} The alpha­bet of all $N$ pos­si­ble states, i.e. val­ues of the ran­dom vari­able $X_t$ \begin{align} \mathcal{V} &= \left\{v_1; v_2; \cdots ; v_M\right\} \end{align} The alpha­bet of all $M$ pos­si­ble obser­va­tions, i.e. emis­sions of the ran­dom vari­able $Y_t$ \begin{align} \underline{A} &\in \mathbb{R}^{N \times N} \end{align} The state tran­si­tion matrix with $a_{ij}$ being the prob­a­bil­i­ty of mov­ing from state $s_i$ to $s_j$ \begin{align} \underline{B} &\in \mathbb{R}^{N \times M} \end{align} The obser­va­tion matrix with $b_{ij} = b_i(v_j)$ being the prob­a­bil­i­ty of emit­ting the obser­va­tion $v_j$ when in state $s_i$ \begin{align} \vec{\pi} &\in \mathbb{R}^{n} \end{align} The ini­tial dis­tri­b­u­tion with $pi_i = P(X_1 = s_i)$ , i.e. the prob­a­bil­i­ty that the 1st state is $s_i$

A HMM with three states s1…sN and four obser­va­tions v1…vM

then giv­en the graph in the image above and assum­ing that the prob­a­bil­i­ty of stay­ing in a state $s_i$ at time $t$ is sim­ply $P(X_t = s_i | X_{t-1} = s_i) = 1-\left(\sum_{j=1}^n{a_{ij}} \quad \forall j \neq i\right)$ (oth­er­wise the sum of prob­a­bil­i­ties of leav­ing $s_3$ would nev­er be 1, as there is no edge in the graph leav­ing $s_3$ — this is a prob­lem with the image though), i.e.

\begin{align}
a_{11} &= 1-(a_{12}) \\
a_{22} &= 1-(a_{21} + a_{23}) \\
a_{33} &= 1
\end{align}

then the matri­ces $\underline{A}$ and $\underline{B}$ would look as fol­lows;

\begin{align}
\underline{A} &= \begin{bmatrix}
a_{11} & a_{12} & 0 \\
a_{21} & a_{22} & a_{23} \\
0 & 0 & a_{33}
\end{bmatrix} \\
\underline{B} &= \begin{bmatrix}
b_{11} & b_{12} & b_{13} & b_{14} \\
b_{21} & b_{22} & b_{23} & b_{24} \\
b_{31} & b_{32} & b_{33} & b_{34} %\\
%b_{41} & b_{42} & b_{43} & b_{44}
\end{bmatrix}
\end{align}

The $\vec{\pi}$ vec­tor isn’t giv­en so one might assume an equal dis­tri­b­u­tion, i.e. $\pi_i = \frac{1}{n} \quad \forall s_i \in \mathcal{S}$ , such that

\begin{align}
\vec{\pi} &= \begin{bmatrix}
1/3 \\
1/3 \\
1/3
\end{bmatrix}
\end{align}

so that there is (or would be) an equal prob­a­bil­i­ty to start in any of the pos­si­ble states.

## Initial, transition and emission probabilities

Giv­en the mod­el $\lambda$ , here are some prob­a­bil­i­ties to get us start­ed. But first, some nota­tions:

 $P(X_t = s_i)$ The prob­a­bil­i­ty of the state at time $t$ being $s_i$ $P(X_t = s_j | X_{t-1} = s_i)$ The con­di­tion­al prob­a­bil­i­ty of the state at time $t$ being $s_j$ giv­en that the state at time $t-1$ was $s_i$ $P(X_t = s_i, Y_{t} = v_j)$ The prob­a­bil­i­ty of the state at time $t$ being $s_i$ while seing an obser­va­tion that is $v_j$

The prob­a­bil­i­ty that the mod­el starts (i.e. $t=1$ ) in a giv­en state $s_i$ is sim­ply:

\begin{align}
P(X_1 = s_i) &= \pi_i
\end{align}

Also, the (con­di­tion­al) prob­a­bil­i­ty that the mod­el emits an obser­va­tion $v_j$ giv­en that it is in a state $s_i$ at time $t$ is:

\begin{align}
P(Y_t = v_j | X_t = s_i) &= b_{ij}
\end{align}

The prob­a­bil­i­ty that the mod­el starts in a giv­en state $s_i$ while emit­ting the obser­va­tion $v_j$ then is:

\begin{align}
P(X_1 = s_i, Y_1 = v_j) &= P(X_1 = s_i) \cdot P(Y_1 = v_j | X_1 = s_i) \\
&= \pi_i \cdot b_{ij}
\end{align}

This is because the emis­sion of an obser­va­tion and the selec­tion of a state are con­di­tion­al­ly inde­pen­dent.

By def­i­n­i­tion, the (con­di­tion­al) prob­a­bil­i­ty of mov­ing from a state $s_i$ at time $t$ to $s_j$ at time $t+1$ is the prob­a­bil­i­ty of being at state $s_j$ at time $t$ giv­en the state $s_i$ at time $t-1$ , which sim­ply is:

\begin{align}
P(X_{t} = s_j | X_{t-1} = s_i) &= P(X_{t+1} = s_j | X_{t} = s_i) = a_{ij}
\end{align}

Con­se­quent­ly, the con­di­tion­al prob­a­bil­i­ty of mov­ing from a state $s_i$ to $s_j$ while emit­ting the obser­va­tion $v_k$ is then:

\begin{align}
P(X_t = s_j, Y_t = v_k| X_{t-1} = s_i) &= P(X_{t} = s_j | X_{t-1} = s_i) \cdot P(Y_t = v_k | X_t = s_j) \\
&= a_{ij} \cdot b_{jk}
\end{align}

Final­ly, the con­di­tion­al prob­a­bil­i­ty of start­ing in state $s_i$ while emit­ting the obser­va­tion $v_k$ , then tran­si­tion­ing to state $s_j$ while emit­ting the obser­va­tion $v_l$ is

\begin{align}
P(X_2 = s_j, Y_2 = v_l | X_1 = s_i, Y_1 = v_k) &= P(X_1 = s_i, Y_1 = v_k) \cdot P(X_2 = s_j, Y_2 = v_l | X_1 = s_i) \\
&= \pi_i \cdot b_{ik} \cdot a_{ij} \cdot b_{jl}
\end{align}

To make things a bit eas­i­er lat­er on, we define $b_j(Y_t)$ as a func­tion of the emis­sion matrix $\underline{B}$ such that

\begin{align}
b_i(Y_t = v_j) &= P(Y_t = v_j, X_t = s_i) = b_{ij} \\
\Leftrightarrow Y_t &= v_j \,\implies \, b_i(Y_t) = P(Y_t, X_t = s_i) = b_{ij}
\end{align}

## The forward probabilities

The for­ward prob­a­bil­i­ty $\alpha_t(i)$ describes the prob­a­bil­i­ty of a (par­tial) obser­va­tion sequence $Y_1,Y_2, \cdots Y_t$ hav­ing been emit­ted while end­ing in state $s_i$ at time $t$ , i.e.

\begin{align}
\alpha_t(i) &= P(Y_1, Y_2, \cdots, Y_t, X_t = s_i)
\end{align}

It can be seen intu­itive­ly that $\alpha_1(i)$ is noth­ing but the prob­a­bil­i­ty of start­ing (i.e. at time $t=1$ ) with a state $s_i$ while emit­ting a spe­cif­ic $Y \in \mathcal{V}$.

\begin{align}
\alpha_1(i) &= P(Y_1, X_1 = s_i) = \pi_i \cdot b_{i}(Y_1), \qquad 1 \leq i \leq N
\end{align}

Let’s assume the graph depict­ed in the image above implic­it­ly con­tains the prob­a­bil­i­ties to stay in each node, i.e. $P(X_t = s_i | X_{t-1} = s_i) > 0 \, \forall s_i \in \mathcal{S}$ (which is miss­ing in the image as there are no such edges).
Let’s also assume that we have seen the obser­va­tion sequence $Y_{1\cdots2} = \left\{v_3, v_1\right\}$ at times $t=1\cdots 2$ (i.e. $T=2$ ) and that we are inter­est­ed in the prob­a­bil­i­ty of being in state $s_1$ after that obser­va­tion.

Now there are mul­ti­ple ways to reach state $s_1$ at time $t=2$ (note that there is no edge between $s_3$ and $s_1$ ):

$X_1 \to X_2$ $P(X_2|X_1)$ $P(X_2, Y_2 | X_1, Y_1)$
$s_1 \to s_1$ $a_{11}$ $\pi_{1} \cdot b_1(v_3) \cdot a_{11} \cdot b_1(v_1)$
$s_2 \to s_1$ $a_{21}$ $\pi_{2} \cdot b_2(v_3) \cdot a_{21} \cdot b_1(v_1)$
$s_3 \to s_1$ $a_{31}$ = 0 $\pi_{3} \cdot b_3(v_3) \cdot a_{31} \cdot b_1(v_1) = 0$

Now the total prob­a­bil­i­ty of reach­ing $s_1$ while seing the emis­sion $v_1$ is the sum over all these path prob­a­bil­i­ties, which is $\alpha_{t = 2}(i = 1)$ :

\begin{align}
\pi_i &= \frac{E\left(\text{Number of times a sequence started with state}\, s_i\right)}{E\left(\text{Number of times a sequence started with any state}\right)}
\end{align}

Anoth­er for the tran­si­tion prob­a­bil­i­ties:

\begin{align}
a_{ij} &= \frac{E\left(\text{Number of times the state changed from}\, s_i \, \text{to}\,s_j\right)}{E\left(\text{Number of times the state changed from}\, s_i \, \text{to any state}\right)}
\end{align}

And the last one for the emis­sion prob­a­bil­i­ties:

\begin{align}
b_{ik} &= \frac{E\left(\text{Number of times the state was}\, s_i \, \text{and the observation was}\,v_k\right)}{E\left(\text{Number of times the state was}\, s_i\right)}
\end{align}

If one had a ful­ly labeled train­ing cor­pus rep­re­sent­ing all pos­si­ble out­comes, this would be exact­ly the opti­mal solu­tion: Count each occur­rence, nor­mal­ize and you’re good. If, how­ev­er, no such labeled train­ing cor­pus is avail­able — i.e. only obser­va­tions are giv­en, no accord­ing state sequences — the expect­ed val­ues $E(c)$ of these counts would have to be esti­mat­ed. This can be done (and is done) using the for­ward and back­ward prob­a­bil­i­ties $\alpha_t(i)$ and $\beta_t(i)$ , as described below.

## Hidden Markov Models

Giv­en a Hid­den Markov Mod­el $\lambda$ as  $\lambda = \left\{\mathcal{S}; \mathcal{V}; \underline{A}; \underline{B}; \vec{\pi}\right\}$ with

 \begin{align} \mathcal{S} &= \left\{s_1; s_2; \cdots ; s_N\right\} \end{align} The alpha­bet of all $N$ pos­si­ble states, i.e. val­ues of the ran­dom vari­able $X_t$ \begin{align} \mathcal{V} &= \left\{v_1; v_2; \cdots ; v_M\right\} \end{align} The alpha­bet of all $M$ pos­si­ble obser­va­tions, i.e. emis­sions of the ran­dom vari­able $Y_t$ \begin{align} \underline{A} &\in \mathbb{R}^{N \times N} \end{align} The state tran­si­tion matrix with $a_{ij}$ being the prob­a­bil­i­ty of mov­ing from state $s_i$ to $s_j$ \begin{align} \underline{B} &\in \mathbb{R}^{N \times M} \end{align} The obser­va­tion matrix with $b_{ij} = b_i(v_j)$ being the prob­a­bil­i­ty of emit­ting the obser­va­tion $v_j$ when in state $s_i$ \begin{align} \vec{\pi} &\in \mathbb{R}^{n} \end{align} The ini­tial dis­tri­b­u­tion with $pi_i = P(X_1 = s_i)$ , i.e. the prob­a­bil­i­ty that the 1st state is $s_i$

A HMM with three states s1…sN and four obser­va­tions v1…vM

then giv­en the graph in the image above and assum­ing that the prob­a­bil­i­ty of stay­ing in a state $s_i$ at time $t$ is sim­ply $P(X_t = s_i | X_{t-1} = s_i) = 1-\left(\sum_{j=1}^n{a_{ij}} \quad \forall j \neq i\right)$ (oth­er­wise the sum of prob­a­bil­i­ties of leav­ing $s_3$ would nev­er be 1, as there is no edge in the graph leav­ing $s_3$ — this is a prob­lem with the image though), i.e.

\begin{align}
a_{11} &= 1-(a_{12}) \\
a_{22} &= 1-(a_{21} + a_{23}) \\
a_{33} &= 1
\end{align}

then the matri­ces $\underline{A}$ and $\underline{B}$ would look as fol­lows;

\begin{align}
\underline{A} &= \begin{bmatrix}
a_{11} & a_{12} & 0 \\
a_{21} & a_{22} & a_{23} \\
0 & 0 & a_{33}
\end{bmatrix} \\
\underline{B} &= \begin{bmatrix}
b_{11} & b_{12} & b_{13} & b_{14} \\
b_{21} & b_{22} & b_{23} & b_{24} \\
b_{31} & b_{32} & b_{33} & b_{34} %\\
%b_{41} & b_{42} & b_{43} & b_{44}
\end{bmatrix}
\end{align}

The $\vec{\pi}$ vec­tor isn’t giv­en so one might assume an equal dis­tri­b­u­tion, i.e. $\pi_i = \frac{1}{n} \quad \forall s_i \in \mathcal{S}$ , such that

\begin{align}
\vec{\pi} &= \begin{bmatrix}
1/3 \\
1/3 \\
1/3
\end{bmatrix}
\end{align}

so that there is (or would be) an equal prob­a­bil­i­ty to start in any of the pos­si­ble states.

## Initial, transition and emission probabilities

Giv­en the mod­el $\lambda$ , here are some prob­a­bil­i­ties to get us start­ed. But first, some nota­tions:

 $P(X_t = s_i)$ The prob­a­bil­i­ty of the state at time $t$ being $s_i$ $P(X_t = s_j | X_{t-1} = s_i)$ The con­di­tion­al prob­a­bil­i­ty of the state at time $t$ being $s_j$ giv­en that the state at time $t-1$ was $s_i$ $P(X_t = s_i, Y_{t} = v_j)$ The prob­a­bil­i­ty of the state at time $t$ being $s_i$ while seing an obser­va­tion that is $v_j$

The prob­a­bil­i­ty that the mod­el starts (i.e. $t=1$ ) in a giv­en state $s_i$ is sim­ply:

\begin{align}
P(X_1 = s_i) &= \pi_i
\end{align}

Also, the (con­di­tion­al) prob­a­bil­i­ty that the mod­el emits an obser­va­tion $v_j$ giv­en that it is in a state $s_i$ at time $t$ is:

\begin{align}
P(Y_t = v_j | X_t = s_i) &= b_{ij}
\end{align}

The prob­a­bil­i­ty that the mod­el starts in a giv­en state $s_i$ while emit­ting the obser­va­tion $v_j$ then is:

\begin{align}
P(X_1 = s_i, Y_1 = v_j) &= P(X_1 = s_i) \cdot P(Y_1 = v_j | X_1 = s_i) \\
&= \pi_i \cdot b_{ij}
\end{align}

This is because the emis­sion of an obser­va­tion and the selec­tion of a state are con­di­tion­al­ly inde­pen­dent.

By def­i­n­i­tion, the (con­di­tion­al) prob­a­bil­i­ty of mov­ing from a state $s_i$ at time $t$ to $s_j$ at time $t+1$ is the prob­a­bil­i­ty of being at state $s_j$ at time $t$ giv­en the state $s_i$ at time $t-1$ , which sim­ply is:

\begin{align}
P(X_{t} = s_j | X_{t-1} = s_i) &= P(X_{t+1} = s_j | X_{t} = s_i) = a_{ij}
\end{align}

Con­se­quent­ly, the con­di­tion­al prob­a­bil­i­ty of mov­ing from a state $s_i$ to $s_j$ while emit­ting the obser­va­tion $v_k$ is then:

\begin{align}
P(X_t = s_j, Y_t = v_k| X_{t-1} = s_i) &= P(X_{t} = s_j | X_{t-1} = s_i) \cdot P(Y_t = v_k | X_t = s_j) \\
&= a_{ij} \cdot b_{jk}
\end{align}

Final­ly, the con­di­tion­al prob­a­bil­i­ty of start­ing in state $s_i$ while emit­ting the obser­va­tion $v_k$ , then tran­si­tion­ing to state $s_j$ while emit­ting the obser­va­tion $v_l$ is

\begin{align}
P(X_2 = s_j, Y_2 = v_l | X_1 = s_i, Y_1 = v_k) &= P(X_1 = s_i, Y_1 = v_k) \cdot P(X_2 = s_j, Y_2 = v_l | X_1 = s_i) \\
&= \pi_i \cdot b_{ik} \cdot a_{ij} \cdot b_{jl}
\end{align}

To make things a bit eas­i­er lat­er on, we define $b_j(Y_t)$ as a func­tion of the emis­sion matrix $\underline{B}$ such that

\begin{align}
b_i(Y_t = v_j) &= P(Y_t = v_j, X_t = s_i) = b_{ij} \\
\Leftrightarrow Y_t &= v_j \,\implies \, b_i(Y_t) = P(Y_t, X_t = s_i) = b_{ij}
\end{align}

## The forward probabilities

The for­ward prob­a­bil­i­ty $\alpha_t(i)$ describes the prob­a­bil­i­ty of a (par­tial) obser­va­tion sequence $Y_1,Y_2, \cdots Y_t$ hav­ing been emit­ted while end­ing in state $s_i$ at time $t$ , i.e.

\begin{align}
\alpha_t(i) &= P(Y_1, Y_2, \cdots, Y_t, X_t = s_i)
\end{align}

It can be seen intu­itive­ly that $\alpha_1(i)$ is noth­ing but the prob­a­bil­i­ty of start­ing (i.e. at time $t=1$ ) with a state $s_i$ while emit­ting a spe­cif­ic $Y \in \mathcal{V}$.

\begin{align}
\alpha_1(i) &= P(Y_1, X_1 = s_i) = \pi_i \cdot b_{i}(Y_1), \qquad 1 \leq i \leq N
\end{align}

Let’s assume the graph depict­ed in the image above implic­it­ly con­tains the prob­a­bil­i­ties to stay in each node, i.e. $P(X_t = s_i | X_{t-1} = s_i) > 0 \, \forall s_i \in \mathcal{S}$ (which is miss­ing in the image as there are no such edges).
Let’s also assume that we have seen the obser­va­tion sequence $Y_{1\cdots2} = \left\{v_3, v_1\right\}$ at times $t=1\cdots 2$ (i.e. $T=2$ ) and that we are inter­est­ed in the prob­a­bil­i­ty of being in state $s_1$ after that obser­va­tion.

Now there are mul­ti­ple ways to reach state $s_1$ at time $t=2$ (note that there is no edge between $s_3$ and $s_1$ ):

$X_1 \to X_2$ $P(X_2|X_1)$ $P(X_2, Y_2 | X_1, Y_1)$
$s_1 \to s_1$ $a_{11}$ $\pi_{1} \cdot b_1(v_3) \cdot a_{11} \cdot b_1(v_1)$
$s_2 \to s_1$ $a_{21}$ $\pi_{2} \cdot b_2(v_3) \cdot a_{21} \cdot b_1(v_1)$
$s_3 \to s_1$ $a_{31}$ = 0 $\pi_{3} \cdot b_3(v_3) \cdot a_{31} \cdot b_1(v_1) = 0$

Now the total prob­a­bil­i­ty of reach­ing $s_1$ while seing the emis­sion $v_1$ is the sum over all these path prob­a­bil­i­ties, which is $\alpha_{t = 2}(i = 1)$ :

\begin{align}
\alpha_{2}(1) &= \left( \pi_{1} \cdot b_1(v_3) \cdot a_{11} \cdot b_1(v_1) \right) \\
&+ \left( \pi_{2} \cdot b_2(v_3) \cdot a_{21} \cdot b_1(v_1) \right) \\
&+ \left( \pi_{3} \cdot b_3(v_3) \cdot a_{31} \cdot b_1(v_1) \right) \\
&= \left( \underbrace{\pi_{1} \cdot b_1(v_3)}_{\alpha_1(1)} \cdot a_{11} + \underbrace{\pi_{2} \cdot b_2(v_3)}_{\alpha_1(2)} \cdot a_{21} + \underbrace{\pi_{3} \cdot b_3(v_3)}_{\alpha_1(3)} \cdot a_{31} \right) \cdot b_1(v_1) \\
&= \left( \sum_{i=1}^N \alpha_1(i) \cdot a_{i1} \right) \cdot b_1(v_1)
\end{align}

which brings us to the recur­sive def­i­n­i­tion of the alpha func­tion:

\begin{align}
\alpha_1(i) &= \pi_i \cdot b_{i}(Y_1), &\qquad 1 \leq i \leq N \\
\alpha_t(j) &= \left( \sum_{i=1}^N \alpha_{t-1}(i) \cdot a_{ij} \right) \cdot b_j(Y_t), &\qquad 1 \lt t \leq T, 1 \leq j \leq N
\end{align}

Note that for each time step $t$ we are only inter­est­ed in the prob­a­bil­i­ty of the event in the pre­vi­ous time step $t-1$ (and not $t-2$ , etc.), which is due to the Markov assump­tion.

function [p] = alpha(lambda, Y, t, X)
T = numel(Y);
assert(1 <= t && t <= T, '1 <= t <= T');

% termination criterion
if t == 1
p = lambda.pi(X) * lambda.B(X, Y(1));
return;
end

% recursive calculation
p = 0;
for j = 1:lambda.N
p = p + alpha(lambda, Y, t-1, j) ...
* lambda.A(j, X) * lambda.B(X, Y(t));
end
end


## The backward probabilities

Now that we know how like­ly it is to observe an out­put sequence $Y_{1 \cdots t}$ and, after that, be in a state $s_i$ we are inter­est­ed in obtain­ing the prob­a­bil­i­ty of observ­ing a future sequence $Y_{t+1 \cdots T}$ giv­en that we actu­al­ly are in state $s_i$ at time $t$ , which is where the $\beta_t(i)$ func­tion comes in. I found that func­tion to be eas­i­er to under­stand if one doesn’t think of “future obser­va­tion” as much as a pre­dic­tion step, but more as a prob­a­bil­i­ty that a cur­rent mod­el is able to emit a sequence (start­ing at $t+1$ ) when it comes to train­ing. I’ll try to clar­i­fy with an exam­ple giv­en the above graph.

Giv­en three obser­va­tions (i.e. emis­sions) $Y=\left\{v_3, v_1, v_4\right\}$ (i.e. $T=3$ ) and that we are in state $X_{t=3} = s_1$ at time $t=T$ , what is the prob­a­bil­i­ty of emit­ting any future obser­va­tion $Y_{t=4}$ .

\begin{align}
\beta_3(1) &= P(Y_{4} = any|X_3 = s_1) \\
&= P(v_1|s_1) + P(v_2|s_1) + P(v_3|s_1) + P(v_4|s_1) \\
&= \sum_{j=1}^M{P(y_j|s_1)} = 1
\end{align}

This works out because for every state there is exact­ly one emis­sion by def­i­n­i­tion. Because of this, $P(Y_{T+1} = any)$ is inde­pen­dent of the state and thus

\begin{align}
\beta_T(i) &= P(Y_{T+1} = any|X_T = s_i) \\
&= P(Y_{T+1} = any) \\
&= 1
\end{align}

Now step­ping back to $t = T-1 = 2$; assum­ing that we are in state $X_2 = s_1$, the prob­a­bil­i­ty of emit­ting $Y_{3} = v_4$ at $t=3$ is

\begin{align}
\beta_2(1) &= P(Y_{4} = any, Y_{3} = v_4 | X_2 = s_1) \\
&= P(Y_{4} = any, Y_{3} = v_4, X_3 = s_1 | X_2 = s_1) \\
&+ P(Y_{4} = any, Y_{3} = v_4, X_3 = s_2 | X_2 = s_1) \\
&+ P(Y_{4} = any, Y_{3} = v_4, X_3 = s_3 | X_2 = s_1)
\end{align}

\begin{align}
\beta_2(1) &= P(Y_4=any|X_3=s_1,X_2=s_1) \cdot P(Y_3=v_4,X_3=s_1|X_2=s_1) \\
&+ P(Y_4=any|X_3=s_2,X_2=s_1) \cdot P(Y_3=v_4,X_3=s_2|X_2=s_1) \\
&+ P(Y_4=any|X_3=s_3,X_2=s_1) \cdot P(Y_3=v_4,X_3=s_3|X_2=s_1)
\end{align}

by apply­ing the Markov prin­ci­ple and not­ing that $T=4$ ,

\begin{align}
P(Y_4|X_3=s_i,X_2=s_j) &= P(Y_4|X_3=s_i) \\
&= \beta_3(i) = 1
\end{align}

which yields, by tak­ing into account all the pos­si­ble fol­low­ing states,

\begin{align}
\beta_2(1) &= \beta_3(1) \cdot P(Y_3=v_4,X_3=s_1|X_2=s_1) \\
&+ \beta_3(2) \cdot P(Y_3=v_4,X_3=s_2|X_2=s_1) \\
&+ \beta_3(3) \cdot P(Y_3=v_4,X_3=s_3|X_2=s_1) \\
&= \sum_{j=1}^N \left( P(Y_3=v_4,X_3=s_j|X_2=s_1) \cdot \beta_3(j) \right)\\
&= \sum_{j=1}^N \left( a_{1j} \cdot b_j(Y_3=v_4) \cdot \beta_3(j) \right)
\end{align}

This brings us to the recur­sive form of the back­ward algo­rithm:

\begin{align}
\beta_T(i) &= 1 \\
\beta_t(i) &= \sum_{j=1}^N \left( a_{ij} \cdot b_j(Y_{t+1}) \cdot \beta_{t+1}(j) \right)
\end{align}
function [p] =  beta(lambda, Y, t, X)
T = numel(Y);
assert(1 <= t && t <= T, '1 <= t <= T');

% termination condition for the recursive calculation
if t == T
p = 1;
return;
end

% recursive calculation
p = 0;
for j = 1:lambda.N
p = p + lambda.A(X, j) * lambda.B(j, Y(t+1)) ...
* beta(lambda, Y, t+1, j);
end
end


## Evaluating the Hidden Markov Model

Using either for­ward or back­ward prob­a­bil­i­ties $\alpha_t(i)$ and $\beta_t(i)$ , the HMM $\lambda$ can be eval­u­at­ed giv­en an obser­va­tion sequence $Y = \left\{ Y_1, \cdots Y_T\right\}$ .

\begin{align}
P(Y) &= \sum_{i=1}^{N} \alpha_T(i) \\
&= \sum_{i=1}^{N} \pi_i \cdot b_i(Y_1) \cdot \beta_1(i)
\end{align}

That is, the sum over all pos­si­bil­i­ties of gen­er­at­ing that sequence while end­ing in any of the states (first case) or the sum over all pos­si­bil­i­ties of gen­er­at­ing that sequence while start­ing in any of the states (sec­ond case). This prob­a­bil­i­ty of the mod­el to gen­er­ate the giv­en obser­va­tion sequence can — in the task of eval­u­at­ing a trained mod­el — be used as a mea­sure of the model’s fit­ness.

function [p] =  P(lambda, Y, forward)
T = numel(Y);
p = 0;
if forward == true
% solve using forward probabilities
for i=1:lambda.N
p = p + alpha(lambda, Y, T, i);
end
else
% solve using backward probabilities
for i=1:lambda.N
p = p + lambda.pi(i) * beta(lambda, Y, 1, i) ...
* lambda.B(i, Y(1));
end
end
end


In the pres­ence of a whole cor­pus of obser­va­tions $\mathcal{C} = \left\{Y^1, Y^2, \cdots, Y^C\right\}$ the prob­a­bil­i­ty mea­sure extends by mul­ti­ply­ing the prob­a­bil­i­ties of each obser­va­tion sequence $Y^c \in \mathcal{C}$ .

\begin{align}
P(\mathcal{C}) &= \prod_{c=1}^{C} P(Y^c)
\end{align}

The prob­lem with this approach is that this prod­uct rapid­ly goes towards zero, result­ing in either under­flow or pre­ci­sion loss due to machine lim­i­ta­tions. An alter­na­tive mea­sure is the log-like­li­hood $L(\mathcal{C})$ , which uses the (nat­ur­al) log­a­rithm. Since

\begin{align}
a \cdot b &= e^{\ln a + \ln b}
\end{align}

it fol­lows that

\begin{align}
L(\mathcal{C}) &= \sum_{c=1}^{C} \ln P(Y^c)
\end{align}

where

\begin{align}
P(\mathcal{C}) &= e^{L(\mathcal{C})}
\end{align}

Since the log­a­rithm is a monot­o­n­ic func­tion, larg­er prob­a­bil­i­ties cor­re­spond to larg­er like­li­hoods and vice ver­sa, which makes the $L(\mathcal{C})$ func­tion more fea­si­ble for the expec­ta­tion max­i­miza­tion task.

## Calculating the expectation

Using for­ward and back­ward prob­a­bil­i­ties we are now able to cal­cu­late the expect­ed num­ber of times that the sys­tem is in either state $s_i \in \mathcal{S}$ or that the sys­tem moves from state $s_i \in \mathcal{S} \to s_j \in \mathcal{S}$ .

To do so, we define some addi­tion­al func­tions $\gamma_t(i)$ and $\xi_t(i, j)$ as

\begin{align}
\gamma_t(i) &= \frac{\alpha_t(i) \cdot \beta_t(i)}{P(Y)} \\
\xi_t(i, j) &= \frac{\alpha_t(i) \cdot a_{ij} \cdot b_j(Y_{t+1})\cdot \beta_{t+1}(j)}{P(Y)} \\
\end{align}

where

\begin{align}
\gamma_t(i) &= \sum_{j=1}^N \xi_t(i, j)
\end{align}

Here $\gamma_t(i)$ is the prob­a­bil­i­ty of the sys­tem of being in state $s_i$ at time $t$ (giv­en the obser­va­tion sequence $Y$ ) and $\xi_t(i, j)$ (that’s Xi) is the prob­a­bil­i­ty of being in state $s_i$ at time $t$ while being in state $s_j$ at time $t+1$ , i.e. the tran­si­tion prob­a­bil­i­ty. It’s nat­ur­al that the sum of all prob­a­bil­i­ties of leav­ing a state $s$ any­where is equal to the prob­a­bil­i­ty of being in that state (because, obvi­ous­ly, you have to be in that state in order to leave it).

function [p] =  gamma(lambda, Y, t, i)
T = numel(Y);
assert(1 <= t && t <= T, '1 <= t <= T');

p = alpha(lambda, Y, t, i) * beta(lambda, Y, t, i) / P(lambda, Y);
end

function [p] =  xi(lambda, Y, t, i, j)
T = numel(Y);
assert(1 <= t && t < T, '1 <= t < T');

p = alpha(lambda, Y, t, i) ...
* lambda.A(i, j) * lambda.B(j, Y(t+1)) ...
* beta(lambda, Y, t+1, j) ...
/ P(lambda, Y);
end


## The Baum-Welch equations

Giv­en these num­bers, the Baum-Welch equa­tions can be writ­ten as fol­lows.

### For the initial probabilities

\begin{align}
\tilde{\pi}_i &= \gamma_1(i)
\end{align}

Whis is sim­ple the prob­a­bil­i­ty of being in a giv­en state at time $t=1$ .

### For the transition probabilities

\begin{align}
\tilde{a}_{ij} &= \frac{\sum_{t=1}^{T-1} \xi_t(i, j)}{\sum_{t=1}^{T-1} \gamma_t(i)}
\end{align}

Here, $\sum_{t=1}^{T-1} \xi_t(i, j)$ is the prob­a­bil­i­ty of mov­ing from state $s_i$ to $s_j$ , where $\sum_{t=1}^{T-1} \gamma_t(i)$ is the prob­a­bil­i­ty of start­ing in state $s_i$ , which nor­mal­izes it.

### And for the emission probabilities

\begin{align}
\tilde{b}_{ik} &= \frac{\sum_{t=1,Y_t = v_k}^{T} \gamma_t(i)}{\sum_{t=1}^{T} \gamma_t(i)} = \frac{\sum_{t=1}^{T} \begin{cases} 0,& \text{if}\, Y_t \neq v_k \\ \gamma_t(i),& \text{if}\, Y_t = v_k \end{cases}}{\sum_{t=1}^{T} \gamma_t(i)}
\end{align}

Here $\sum_{t=1,Y_t = v_k}^{T} \gamma_t(i)$ is the sum of prob­a­bil­i­ties of being in state $s_i$ only where an obser­va­tion $Y_t = v_k$ was made. Again, $\sum_{t=1}^{T} \gamma_t(i)$ nor­mal­izes it.

### The optimized new HMM

Giv­en these three para­me­ters, we obtain an opti­mized Hid­den Markov Mod­el

$\tilde{\lambda} = \left\{\mathcal{S}; \mathcal{V}; \tilde{\underline{A}}; \tilde{\underline{B}}; \tilde{\vec{\pi}}\right\}.$

These last steps are repeat­ed until the dif­fer­ence in the mea­sure $P(Y)$ , $P(\mathcal{C})$ or $L(\mathcal{C})$
is small­er than a giv­en thresh­old (or the num­ber of iter­a­tion exceeds a cer­tain count, whichev­er hap­pens first). And that’s that.