2x1=10

because numbers are people, too
Persönliches
Fotografie
Programmierung
    • Hidden Markov Model training using the Baum-Welch Algorithm

      The Baum-Welch algo­rithm deter­mines the (local­ly) opti­mal para­me­ters for a Hid­den Markov Mod­el by essen­tial­ly using three equa­tions.

      One for the ini­tial prob­a­bil­i­ties:


      \begin{align}
      \pi_i &= \frac{E\left(\text{Number of times a sequence started with state}\, s_i\right)}{E\left(\text{Number of times a sequence started with any state}\right)}
      \end{align}

      Anoth­er for the tran­si­tion prob­a­bil­i­ties:


      \begin{align}
      a_{ij} &= \frac{E\left(\text{Number of times the state changed from}\, s_i \, \text{to}\,s_j\right)}{E\left(\text{Number of times the state changed from}\, s_i \, \text{to any state}\right)}
      \end{align}

      And the last one for the emis­sion prob­a­bil­i­ties:


      \begin{align}
      b_{ik} &= \frac{E\left(\text{Number of times the state was}\, s_i \, \text{and the observation was}\,v_k\right)}{E\left(\text{Number of times the state was}\, s_i\right)}
      \end{align}

      If one had a ful­ly labeled train­ing cor­pus rep­re­sent­ing all pos­si­ble out­comes, this would be exact­ly the opti­mal solu­tion: Count each occur­rence, nor­mal­ize and you’re good. If, how­ev­er, no such labeled train­ing cor­pus is avail­able — i.e. only obser­va­tions are giv­en, no accord­ing state sequences — the expect­ed val­ues \(E(c)\) of these counts would have to be esti­mat­ed. This can be done (and is done) using the for­ward and back­ward prob­a­bil­i­ties \(\alpha_t(i)\) and \(\beta_t(i)\) , as described below.

      Hidden Markov Models

      Giv­en a Hid­den Markov Mod­el \(\lambda\) as  \(\lambda = \left\{\mathcal{S}; \mathcal{V}; \underline{A}; \underline{B}; \vec{\pi}\right\}\) with


      \begin{align}
      \mathcal{S} &= \left\{s_1; s_2; \cdots ; s_N\right\}
      \end{align}
      The alpha­bet of all \(N\) pos­si­ble states, i.e. val­ues of the ran­dom vari­able \(X_t\)

      \begin{align}
      \mathcal{V} &= \left\{v_1; v_2; \cdots ; v_M\right\}
      \end{align}
      The alpha­bet of all \(M\) pos­si­ble obser­va­tions, i.e. emis­sions of the ran­dom vari­able \(Y_t\)

      \begin{align}
      \underline{A} &\in \mathbb{R}^{N \times N}
      \end{align}
      The state tran­si­tion matrix with \(a_{ij}\) being the prob­a­bil­i­ty of mov­ing from state \(s_i\) to \(s_j\)

      \begin{align}
      \underline{B} &\in \mathbb{R}^{N \times M}
      \end{align}
      The obser­va­tion matrix with \(b_{ij} = b_i(v_j)\) being the prob­a­bil­i­ty of emit­ting the obser­va­tion \(v_j\) when in state \(s_i\)

      \begin{align}
      \vec{\pi} &\in \mathbb{R}^{n}
      \end{align}
      The ini­tial dis­tri­b­u­tion with \(pi_i = P(X_1 = s_i)\) , i.e. the prob­a­bil­i­ty that the 1st state is \(s_i\)

      A HMM with three states s1…sN and four obser­va­tions v1…vM

      then giv­en the graph in the image above and assum­ing that the prob­a­bil­i­ty of stay­ing in a state \(s_i\) at time \(t\) is sim­ply \(P(X_t = s_i | X_{t-1} = s_i) = 1-\left(\sum_{j=1}^n{a_{ij}} \quad \forall j \neq i\right)\) (oth­er­wise the sum of prob­a­bil­i­ties of leav­ing \(s_3\) would nev­er be 1, as there is no edge in the graph leav­ing \(s_3\) — this is a prob­lem with the image though), i.e.


      \begin{align}
      a_{11} &= 1-(a_{12}) \\
      a_{22} &= 1-(a_{21} + a_{23}) \\
      a_{33} &= 1
      \end{align}

      then the matri­ces \(\underline{A}\) and \(\underline{B}\) would look as fol­lows;


      \begin{align}
      \underline{A} &= \begin{bmatrix}
      a_{11} & a_{12} & 0 \\
      a_{21} & a_{22} & a_{23} \\
      0 & 0 & a_{33}
      \end{bmatrix} \\
      \underline{B} &= \begin{bmatrix}
      b_{11} & b_{12} & b_{13} & b_{14} \\
      b_{21} & b_{22} & b_{23} & b_{24} \\
      b_{31} & b_{32} & b_{33} & b_{34} %\\
      %b_{41} & b_{42} & b_{43} & b_{44}
      \end{bmatrix}
      \end{align}

      The \(\vec{\pi}\) vec­tor isn’t giv­en so one might assume an equal dis­tri­b­u­tion, i.e. \(\pi_i = \frac{1}{n} \quad \forall s_i \in \mathcal{S}\) , such that


      \begin{align}
      \vec{\pi} &= \begin{bmatrix}
      1/3 \\
      1/3 \\
      1/3
      \end{bmatrix}
      \end{align}

      so that there is (or would be) an equal prob­a­bil­i­ty to start in any of the pos­si­ble states.

      Initial, transition and emission probabilities

      Giv­en the mod­el \(\lambda\) , here are some prob­a­bil­i­ties to get us start­ed. But first, some nota­tions:

      \(
      P(X_t = s_i)
      \)
      The prob­a­bil­i­ty of the state at time \(t\) being \(s_i\)
      \(
      P(X_t = s_j | X_{t-1} = s_i)
      \)
      The con­di­tion­al prob­a­bil­i­ty of the state at time \(t\) being \(s_j\) giv­en that the state at time \(t-1\) was \(s_i\)
      \(
      P(X_t = s_i, Y_{t} = v_j)
      \)
      The prob­a­bil­i­ty of the state at time \(t\) being \(s_i\) while seing an obser­va­tion that is \(v_j\)

      The prob­a­bil­i­ty that the mod­el starts (i.e. \(t=1\) ) in a giv­en state \(s_i\) is sim­ply:


      \begin{align}
      P(X_1 = s_i) &= \pi_i
      \end{align}

      Also, the (con­di­tion­al) prob­a­bil­i­ty that the mod­el emits an obser­va­tion \(v_j\) giv­en that it is in a state \(s_i\) at time \(t\) is:


      \begin{align}
      P(Y_t = v_j | X_t = s_i) &= b_{ij}
      \end{align}

      The prob­a­bil­i­ty that the mod­el starts in a giv­en state \(s_i\) while emit­ting the obser­va­tion $v_j$ then is:


      \begin{align}
      P(X_1 = s_i, Y_1 = v_j) &= P(X_1 = s_i) \cdot P(Y_1 = v_j | X_1 = s_i) \\
      &= \pi_i \cdot b_{ij}
      \end{align}

      This is because the emis­sion of an obser­va­tion and the selec­tion of a state are con­di­tion­al­ly inde­pen­dent.

      By def­i­n­i­tion, the (con­di­tion­al) prob­a­bil­i­ty of mov­ing from a state \(s_i\) at time \(t\) to \(s_j\) at time \(t+1\) is the prob­a­bil­i­ty of being at state \(s_j\) at time \(t\) giv­en the state \(s_i\) at time \(t-1\) , which sim­ply is:


      \begin{align}
      P(X_{t} = s_j | X_{t-1} = s_i) &= P(X_{t+1} = s_j | X_{t} = s_i) = a_{ij}
      \end{align}

      Con­se­quent­ly, the con­di­tion­al prob­a­bil­i­ty of mov­ing from a state \(s_i\) to \(s_j\) while emit­ting the obser­va­tion \(v_k\) is then:


      \begin{align}
      P(X_t = s_j, Y_t = v_k| X_{t-1} = s_i) &= P(X_{t} = s_j | X_{t-1} = s_i) \cdot P(Y_t = v_k | X_t = s_j) \\
      &= a_{ij} \cdot b_{jk}
      \end{align}

      Final­ly, the con­di­tion­al prob­a­bil­i­ty of start­ing in state \(s_i\) while emit­ting the obser­va­tion \(v_k\) , then tran­si­tion­ing to state \(s_j\) while emit­ting the obser­va­tion \(v_l\) is


      \begin{align}
      P(X_2 = s_j, Y_2 = v_l | X_1 = s_i, Y_1 = v_k) &= P(X_1 = s_i, Y_1 = v_k) \cdot P(X_2 = s_j, Y_2 = v_l | X_1 = s_i) \\
      &= \pi_i \cdot b_{ik} \cdot a_{ij} \cdot b_{jl}
      \end{align}

      To make things a bit eas­i­er lat­er on, we define \(b_j(Y_t)\) as a func­tion of the emis­sion matrix \(\underline{B}\) such that


      \begin{align}
      b_i(Y_t = v_j) &= P(Y_t = v_j, X_t = s_i) = b_{ij} \\
      \Leftrightarrow Y_t &= v_j \,\implies \, b_i(Y_t) = P(Y_t, X_t = s_i) = b_{ij}
      \end{align}

      The forward probabilities

      The for­ward prob­a­bil­i­ty \(\alpha_t(i)\) describes the prob­a­bil­i­ty of a (par­tial) obser­va­tion sequence \(Y_1,Y_2, \cdots Y_t\) hav­ing been emit­ted while end­ing in state \(s_i\) at time \(t\) , i.e.


      \begin{align}
      \alpha_t(i) &= P(Y_1, Y_2, \cdots, Y_t, X_t = s_i)
      \end{align}

      It can be seen intu­itive­ly that \(\alpha_1(i)\) is noth­ing but the prob­a­bil­i­ty of start­ing (i.e. at time \(t=1\) ) with a state \(s_i\) while emit­ting a spe­cif­ic $Y \in \mathcal{V}$.


      \begin{align}
      \alpha_1(i) &= P(Y_1, X_1 = s_i) = \pi_i \cdot b_{i}(Y_1), \qquad 1 \leq i \leq N
      \end{align}

      Let’s assume the graph depict­ed in the image above implic­it­ly con­tains the prob­a­bil­i­ties to stay in each node, i.e. $P(X_t = s_i | X_{t-1} = s_i) > 0 \, \forall s_i \in \mathcal{S}$ (which is miss­ing in the image as there are no such edges).
      Let’s also assume that we have seen the obser­va­tion sequence \(Y_{1\cdots2} = \left\{v_3, v_1\right\}\) at times \(t=1\cdots 2\) (i.e. \(T=2\) ) and that we are inter­est­ed in the prob­a­bil­i­ty of being in state \(s_1\) after that obser­va­tion.

      Now there are mul­ti­ple ways to reach state \(s_1\) at time \(t=2\) (note that there is no edge between \(s_3\) and \(s_1\) ):

      \(X_1 \to X_2\) \(P(X_2|X_1)\) \(P(X_2, Y_2 | X_1, Y_1)\)
      \(s_1 \to s_1\) \(a_{11}\) \(\pi_{1} \cdot b_1(v_3) \cdot a_{11} \cdot b_1(v_1)\)
      \(s_2 \to s_1\) \(a_{21}\) \(\pi_{2} \cdot b_2(v_3) \cdot a_{21} \cdot b_1(v_1)\)
      \(s_3 \to s_1\) \(a_{31}\) = 0 \(\pi_{3} \cdot b_3(v_3) \cdot a_{31} \cdot b_1(v_1) = 0\)

      Now the total prob­a­bil­i­ty of reach­ing \(s_1\) while seing the emis­sion \(v_1\) is the sum over all these path prob­a­bil­i­ties, which is \(\alpha_{t = 2}(i = 1)\) :


      \begin{align}
      \pi_i &= \frac{E\left(\text{Number of times a sequence started with state}\, s_i\right)}{E\left(\text{Number of times a sequence started with any state}\right)}
      \end{align}

      Anoth­er for the tran­si­tion prob­a­bil­i­ties:


      \begin{align}
      a_{ij} &= \frac{E\left(\text{Number of times the state changed from}\, s_i \, \text{to}\,s_j\right)}{E\left(\text{Number of times the state changed from}\, s_i \, \text{to any state}\right)}
      \end{align}

      And the last one for the emis­sion prob­a­bil­i­ties:


      \begin{align}
      b_{ik} &= \frac{E\left(\text{Number of times the state was}\, s_i \, \text{and the observation was}\,v_k\right)}{E\left(\text{Number of times the state was}\, s_i\right)}
      \end{align}

      If one had a ful­ly labeled train­ing cor­pus rep­re­sent­ing all pos­si­ble out­comes, this would be exact­ly the opti­mal solu­tion: Count each occur­rence, nor­mal­ize and you’re good. If, how­ev­er, no such labeled train­ing cor­pus is avail­able — i.e. only obser­va­tions are giv­en, no accord­ing state sequences — the expect­ed val­ues \(E(c)\) of these counts would have to be esti­mat­ed. This can be done (and is done) using the for­ward and back­ward prob­a­bil­i­ties \(\alpha_t(i)\) and \(\beta_t(i)\) , as described below.

      Hidden Markov Models

      Giv­en a Hid­den Markov Mod­el \(\lambda\) as  \(\lambda = \left\{\mathcal{S}; \mathcal{V}; \underline{A}; \underline{B}; \vec{\pi}\right\}\) with


      \begin{align}
      \mathcal{S} &= \left\{s_1; s_2; \cdots ; s_N\right\}
      \end{align}
      The alpha­bet of all \(N\) pos­si­ble states, i.e. val­ues of the ran­dom vari­able \(X_t\)

      \begin{align}
      \mathcal{V} &= \left\{v_1; v_2; \cdots ; v_M\right\}
      \end{align}
      The alpha­bet of all \(M\) pos­si­ble obser­va­tions, i.e. emis­sions of the ran­dom vari­able \(Y_t\)

      \begin{align}
      \underline{A} &\in \mathbb{R}^{N \times N}
      \end{align}
      The state tran­si­tion matrix with \(a_{ij}\) being the prob­a­bil­i­ty of mov­ing from state \(s_i\) to \(s_j\)

      \begin{align}
      \underline{B} &\in \mathbb{R}^{N \times M}
      \end{align}
      The obser­va­tion matrix with \(b_{ij} = b_i(v_j)\) being the prob­a­bil­i­ty of emit­ting the obser­va­tion \(v_j\) when in state \(s_i\)

      \begin{align}
      \vec{\pi} &\in \mathbb{R}^{n}
      \end{align}
      The ini­tial dis­tri­b­u­tion with \(pi_i = P(X_1 = s_i)\) , i.e. the prob­a­bil­i­ty that the 1st state is \(s_i\)

      A HMM with three states s1…sN and four obser­va­tions v1…vM

      then giv­en the graph in the image above and assum­ing that the prob­a­bil­i­ty of stay­ing in a state \(s_i\) at time \(t\) is sim­ply \(P(X_t = s_i | X_{t-1} = s_i) = 1-\left(\sum_{j=1}^n{a_{ij}} \quad \forall j \neq i\right)\) (oth­er­wise the sum of prob­a­bil­i­ties of leav­ing \(s_3\) would nev­er be 1, as there is no edge in the graph leav­ing \(s_3\) — this is a prob­lem with the image though), i.e.


      \begin{align}
      a_{11} &= 1-(a_{12}) \\
      a_{22} &= 1-(a_{21} + a_{23}) \\
      a_{33} &= 1
      \end{align}

      then the matri­ces \(\underline{A}\) and \(\underline{B}\) would look as fol­lows;


      \begin{align}
      \underline{A} &= \begin{bmatrix}
      a_{11} & a_{12} & 0 \\
      a_{21} & a_{22} & a_{23} \\
      0 & 0 & a_{33}
      \end{bmatrix} \\
      \underline{B} &= \begin{bmatrix}
      b_{11} & b_{12} & b_{13} & b_{14} \\
      b_{21} & b_{22} & b_{23} & b_{24} \\
      b_{31} & b_{32} & b_{33} & b_{34} %\\
      %b_{41} & b_{42} & b_{43} & b_{44}
      \end{bmatrix}
      \end{align}

      The \(\vec{\pi}\) vec­tor isn’t giv­en so one might assume an equal dis­tri­b­u­tion, i.e. \(\pi_i = \frac{1}{n} \quad \forall s_i \in \mathcal{S}\) , such that


      \begin{align}
      \vec{\pi} &= \begin{bmatrix}
      1/3 \\
      1/3 \\
      1/3
      \end{bmatrix}
      \end{align}

      so that there is (or would be) an equal prob­a­bil­i­ty to start in any of the pos­si­ble states.

      Initial, transition and emission probabilities

      Giv­en the mod­el \(\lambda\) , here are some prob­a­bil­i­ties to get us start­ed. But first, some nota­tions:

      \(
      P(X_t = s_i)
      \)
      The prob­a­bil­i­ty of the state at time \(t\) being \(s_i\)
      \(
      P(X_t = s_j | X_{t-1} = s_i)
      \)
      The con­di­tion­al prob­a­bil­i­ty of the state at time \(t\) being \(s_j\) giv­en that the state at time \(t-1\) was \(s_i\)
      \(
      P(X_t = s_i, Y_{t} = v_j)
      \)
      The prob­a­bil­i­ty of the state at time \(t\) being \(s_i\) while seing an obser­va­tion that is \(v_j\)

      The prob­a­bil­i­ty that the mod­el starts (i.e. \(t=1\) ) in a giv­en state \(s_i\) is sim­ply:


      \begin{align}
      P(X_1 = s_i) &= \pi_i
      \end{align}

      Also, the (con­di­tion­al) prob­a­bil­i­ty that the mod­el emits an obser­va­tion \(v_j\) giv­en that it is in a state \(s_i\) at time \(t\) is:


      \begin{align}
      P(Y_t = v_j | X_t = s_i) &= b_{ij}
      \end{align}

      The prob­a­bil­i­ty that the mod­el starts in a giv­en state \(s_i\) while emit­ting the obser­va­tion $v_j$ then is:


      \begin{align}
      P(X_1 = s_i, Y_1 = v_j) &= P(X_1 = s_i) \cdot P(Y_1 = v_j | X_1 = s_i) \\
      &= \pi_i \cdot b_{ij}
      \end{align}

      This is because the emis­sion of an obser­va­tion and the selec­tion of a state are con­di­tion­al­ly inde­pen­dent.

      By def­i­n­i­tion, the (con­di­tion­al) prob­a­bil­i­ty of mov­ing from a state \(s_i\) at time \(t\) to \(s_j\) at time \(t+1\) is the prob­a­bil­i­ty of being at state \(s_j\) at time \(t\) giv­en the state \(s_i\) at time \(t-1\) , which sim­ply is:


      \begin{align}
      P(X_{t} = s_j | X_{t-1} = s_i) &= P(X_{t+1} = s_j | X_{t} = s_i) = a_{ij}
      \end{align}

      Con­se­quent­ly, the con­di­tion­al prob­a­bil­i­ty of mov­ing from a state \(s_i\) to \(s_j\) while emit­ting the obser­va­tion \(v_k\) is then:


      \begin{align}
      P(X_t = s_j, Y_t = v_k| X_{t-1} = s_i) &= P(X_{t} = s_j | X_{t-1} = s_i) \cdot P(Y_t = v_k | X_t = s_j) \\
      &= a_{ij} \cdot b_{jk}
      \end{align}

      Final­ly, the con­di­tion­al prob­a­bil­i­ty of start­ing in state \(s_i\) while emit­ting the obser­va­tion \(v_k\) , then tran­si­tion­ing to state \(s_j\) while emit­ting the obser­va­tion \(v_l\) is


      \begin{align}
      P(X_2 = s_j, Y_2 = v_l | X_1 = s_i, Y_1 = v_k) &= P(X_1 = s_i, Y_1 = v_k) \cdot P(X_2 = s_j, Y_2 = v_l | X_1 = s_i) \\
      &= \pi_i \cdot b_{ik} \cdot a_{ij} \cdot b_{jl}
      \end{align}

      To make things a bit eas­i­er lat­er on, we define \(b_j(Y_t)\) as a func­tion of the emis­sion matrix \(\underline{B}\) such that


      \begin{align}
      b_i(Y_t = v_j) &= P(Y_t = v_j, X_t = s_i) = b_{ij} \\
      \Leftrightarrow Y_t &= v_j \,\implies \, b_i(Y_t) = P(Y_t, X_t = s_i) = b_{ij}
      \end{align}

      The forward probabilities

      The for­ward prob­a­bil­i­ty \(\alpha_t(i)\) describes the prob­a­bil­i­ty of a (par­tial) obser­va­tion sequence \(Y_1,Y_2, \cdots Y_t\) hav­ing been emit­ted while end­ing in state \(s_i\) at time \(t\) , i.e.


      \begin{align}
      \alpha_t(i) &= P(Y_1, Y_2, \cdots, Y_t, X_t = s_i)
      \end{align}

      It can be seen intu­itive­ly that \(\alpha_1(i)\) is noth­ing but the prob­a­bil­i­ty of start­ing (i.e. at time \(t=1\) ) with a state \(s_i\) while emit­ting a spe­cif­ic $Y \in \mathcal{V}$.


      \begin{align}
      \alpha_1(i) &= P(Y_1, X_1 = s_i) = \pi_i \cdot b_{i}(Y_1), \qquad 1 \leq i \leq N
      \end{align}

      Let’s assume the graph depict­ed in the image above implic­it­ly con­tains the prob­a­bil­i­ties to stay in each node, i.e. $P(X_t = s_i | X_{t-1} = s_i) > 0 \, \forall s_i \in \mathcal{S}$ (which is miss­ing in the image as there are no such edges).
      Let’s also assume that we have seen the obser­va­tion sequence \(Y_{1\cdots2} = \left\{v_3, v_1\right\}\) at times \(t=1\cdots 2\) (i.e. \(T=2\) ) and that we are inter­est­ed in the prob­a­bil­i­ty of being in state \(s_1\) after that obser­va­tion.

      Now there are mul­ti­ple ways to reach state \(s_1\) at time \(t=2\) (note that there is no edge between \(s_3\) and \(s_1\) ):

      \(X_1 \to X_2\) \(P(X_2|X_1)\) \(P(X_2, Y_2 | X_1, Y_1)\)
      \(s_1 \to s_1\) \(a_{11}\) \(\pi_{1} \cdot b_1(v_3) \cdot a_{11} \cdot b_1(v_1)\)
      \(s_2 \to s_1\) \(a_{21}\) \(\pi_{2} \cdot b_2(v_3) \cdot a_{21} \cdot b_1(v_1)\)
      \(s_3 \to s_1\) \(a_{31}\) = 0 \(\pi_{3} \cdot b_3(v_3) \cdot a_{31} \cdot b_1(v_1) = 0\)

      Now the total prob­a­bil­i­ty of reach­ing \(s_1\) while seing the emis­sion \(v_1\) is the sum over all these path prob­a­bil­i­ties, which is \(\alpha_{t = 2}(i = 1)\) :


      \begin{align}
      \alpha_{2}(1) &= \left( \pi_{1} \cdot b_1(v_3) \cdot a_{11} \cdot b_1(v_1) \right) \\
      &+ \left( \pi_{2} \cdot b_2(v_3) \cdot a_{21} \cdot b_1(v_1) \right) \\
      &+ \left( \pi_{3} \cdot b_3(v_3) \cdot a_{31} \cdot b_1(v_1) \right) \\
      &= \left( \underbrace{\pi_{1} \cdot b_1(v_3)}_{\alpha_1(1)} \cdot a_{11} + \underbrace{\pi_{2} \cdot b_2(v_3)}_{\alpha_1(2)} \cdot a_{21} + \underbrace{\pi_{3} \cdot b_3(v_3)}_{\alpha_1(3)} \cdot a_{31} \right) \cdot b_1(v_1) \\
      &= \left( \sum_{i=1}^N \alpha_1(i) \cdot a_{i1} \right) \cdot b_1(v_1)
      \end{align}

      which brings us to the recur­sive def­i­n­i­tion of the alpha func­tion:


      \begin{align}
      \alpha_1(i) &= \pi_i \cdot b_{i}(Y_1), &\qquad 1 \leq i \leq N \\
      \alpha_t(j) &= \left( \sum_{i=1}^N \alpha_{t-1}(i) \cdot a_{ij} \right) \cdot b_j(Y_t), &\qquad 1 \lt t \leq T, 1 \leq j \leq N
      \end{align}

      Note that for each time step \(t\) we are only inter­est­ed in the prob­a­bil­i­ty of the event in the pre­vi­ous time step \(t-1\) (and not \(t-2\) , etc.), which is due to the Markov assump­tion.

      function [p] = alpha(lambda, Y, t, X)
          T = numel(Y);
          assert(1 <= t && t <= T, '1 <= t <= T');
          
          % termination criterion
          if t == 1
              p = lambda.pi(X) * lambda.B(X, Y(1));
              return;
          end
          
          % recursive calculation
          p = 0;
          for j = 1:lambda.N
              p = p + alpha(lambda, Y, t-1, j) ...
                      * lambda.A(j, X) * lambda.B(X, Y(t));
          end
      end
      

      The backward probabilities

      Now that we know how like­ly it is to observe an out­put sequence \(Y_{1 \cdots t} \) and, after that, be in a state \(s_i\) we are inter­est­ed in obtain­ing the prob­a­bil­i­ty of observ­ing a future sequence \(Y_{t+1 \cdots T}\) giv­en that we actu­al­ly are in state \(s_i\) at time \(t\) , which is where the $\beta_t(i)$ func­tion comes in. I found that func­tion to be eas­i­er to under­stand if one doesn’t think of “future obser­va­tion” as much as a pre­dic­tion step, but more as a prob­a­bil­i­ty that a cur­rent mod­el is able to emit a sequence (start­ing at \(t+1\) ) when it comes to train­ing. I’ll try to clar­i­fy with an exam­ple giv­en the above graph.

      Giv­en three obser­va­tions (i.e. emis­sions) \(Y=\left\{v_3, v_1, v_4\right\}\) (i.e. \(T=3\) ) and that we are in state $X_{t=3} = s_1$ at time \(t=T\) , what is the prob­a­bil­i­ty of emit­ting any future obser­va­tion \(Y_{t=4}\) .


      \begin{align}
      \beta_3(1) &= P(Y_{4} = any|X_3 = s_1) \\
      &= P(v_1|s_1) + P(v_2|s_1) + P(v_3|s_1) + P(v_4|s_1) \\
      &= \sum_{j=1}^M{P(y_j|s_1)} = 1
      \end{align}

      This works out because for every state there is exact­ly one emis­sion by def­i­n­i­tion. Because of this, \(P(Y_{T+1} = any)\) is inde­pen­dent of the state and thus


      \begin{align}
      \beta_T(i) &= P(Y_{T+1} = any|X_T = s_i) \\
      &= P(Y_{T+1} = any) \\
      &= 1
      \end{align}

      Now step­ping back to $t = T-1 = 2$; assum­ing that we are in state $X_2 = s_1$, the prob­a­bil­i­ty of emit­ting \(Y_{3} = v_4\) at \(t=3\) is


      \begin{align}
      \beta_2(1) &= P(Y_{4} = any, Y_{3} = v_4 | X_2 = s_1) \\
      &= P(Y_{4} = any, Y_{3} = v_4, X_3 = s_1 | X_2 = s_1) \\
      &+ P(Y_{4} = any, Y_{3} = v_4, X_3 = s_2 | X_2 = s_1) \\
      &+ P(Y_{4} = any, Y_{3} = v_4, X_3 = s_3 | X_2 = s_1)
      \end{align}

      which leads to


      \begin{align}
      \beta_2(1) &= P(Y_4=any|X_3=s_1,X_2=s_1) \cdot P(Y_3=v_4,X_3=s_1|X_2=s_1) \\
      &+ P(Y_4=any|X_3=s_2,X_2=s_1) \cdot P(Y_3=v_4,X_3=s_2|X_2=s_1) \\
      &+ P(Y_4=any|X_3=s_3,X_2=s_1) \cdot P(Y_3=v_4,X_3=s_3|X_2=s_1)
      \end{align}

      by apply­ing the Markov prin­ci­ple and not­ing that \(T=4\) ,


      \begin{align}
      P(Y_4|X_3=s_i,X_2=s_j) &= P(Y_4|X_3=s_i) \\
      &= \beta_3(i) = 1
      \end{align}

      which yields, by tak­ing into account all the pos­si­ble fol­low­ing states,


      \begin{align}
      \beta_2(1) &= \beta_3(1) \cdot P(Y_3=v_4,X_3=s_1|X_2=s_1) \\
      &+ \beta_3(2) \cdot P(Y_3=v_4,X_3=s_2|X_2=s_1) \\
      &+ \beta_3(3) \cdot P(Y_3=v_4,X_3=s_3|X_2=s_1) \\
      &= \sum_{j=1}^N \left( P(Y_3=v_4,X_3=s_j|X_2=s_1) \cdot \beta_3(j) \right)\\
      &= \sum_{j=1}^N \left( a_{1j} \cdot b_j(Y_3=v_4) \cdot \beta_3(j) \right)
      \end{align}

      This brings us to the recur­sive form of the back­ward algo­rithm:


      \begin{align}
      \beta_T(i) &= 1 \\
      \beta_t(i) &= \sum_{j=1}^N \left( a_{ij} \cdot b_j(Y_{t+1}) \cdot \beta_{t+1}(j) \right)
      \end{align}
      function [p] =  beta(lambda, Y, t, X)
          T = numel(Y);
          assert(1 <= t && t <= T, '1 <= t <= T');
          
          % termination condition for the recursive calculation
          if t == T
              p = 1;
              return;
          end
          
          % recursive calculation
          p = 0;
          for j = 1:lambda.N           
              p = p + lambda.A(X, j) * lambda.B(j, Y(t+1)) ...
                    * beta(lambda, Y, t+1, j);
          end
      end
      

      Evaluating the Hidden Markov Model

      Using either for­ward or back­ward prob­a­bil­i­ties \(\alpha_t(i)\) and \(\beta_t(i)\) , the HMM \(\lambda\) can be eval­u­at­ed giv­en an obser­va­tion sequence \(Y = \left\{ Y_1, \cdots Y_T\right\}\) .


      \begin{align}
      P(Y) &= \sum_{i=1}^{N} \alpha_T(i) \\
      &= \sum_{i=1}^{N} \pi_i \cdot b_i(Y_1) \cdot \beta_1(i)
      \end{align}

      That is, the sum over all pos­si­bil­i­ties of gen­er­at­ing that sequence while end­ing in any of the states (first case) or the sum over all pos­si­bil­i­ties of gen­er­at­ing that sequence while start­ing in any of the states (sec­ond case). This prob­a­bil­i­ty of the mod­el to gen­er­ate the giv­en obser­va­tion sequence can — in the task of eval­u­at­ing a trained mod­el — be used as a mea­sure of the model’s fit­ness.

      function [p] =  P(lambda, Y, forward)
          T = numel(Y);
          p = 0;
          if forward == true
              % solve using forward probabilities
              for i=1:lambda.N
                  p = p + alpha(lambda, Y, T, i);        
              end
          else
              % solve using backward probabilities
              for i=1:lambda.N
                  p = p + lambda.pi(i) * beta(lambda, Y, 1, i) ...
                        * lambda.B(i, Y(1));
              end        
          end
      end
      

      In the pres­ence of a whole cor­pus of obser­va­tions \(\mathcal{C} = \left\{Y^1, Y^2, \cdots, Y^C\right\}\) the prob­a­bil­i­ty mea­sure extends by mul­ti­ply­ing the prob­a­bil­i­ties of each obser­va­tion sequence \(Y^c \in \mathcal{C}\) .


      \begin{align}
      P(\mathcal{C}) &= \prod_{c=1}^{C} P(Y^c)
      \end{align}

      The prob­lem with this approach is that this prod­uct rapid­ly goes towards zero, result­ing in either under­flow or pre­ci­sion loss due to machine lim­i­ta­tions. An alter­na­tive mea­sure is the log-like­li­hood \(L(\mathcal{C})\) , which uses the (nat­ur­al) log­a­rithm. Since


      \begin{align}
      a \cdot b &= e^{\ln a + \ln b}
      \end{align}

      it fol­lows that


      \begin{align}
      L(\mathcal{C}) &= \sum_{c=1}^{C} \ln P(Y^c)
      \end{align}

      where


      \begin{align}
      P(\mathcal{C}) &= e^{L(\mathcal{C})}
      \end{align}

      Since the log­a­rithm is a monot­o­n­ic func­tion, larg­er prob­a­bil­i­ties cor­re­spond to larg­er like­li­hoods and vice ver­sa, which makes the \(L(\mathcal{C})\) func­tion more fea­si­ble for the expec­ta­tion max­i­miza­tion task.

      Calculating the expectation

      Using for­ward and back­ward prob­a­bil­i­ties we are now able to cal­cu­late the expect­ed num­ber of times that the sys­tem is in either state \(s_i \in \mathcal{S}\) or that the sys­tem moves from state \(s_i \in \mathcal{S} \to s_j \in \mathcal{S}\) .

      To do so, we define some addi­tion­al func­tions \(\gamma_t(i)\) and \(\xi_t(i, j)\) as


      \begin{align}
      \gamma_t(i) &= \frac{\alpha_t(i) \cdot \beta_t(i)}{P(Y)} \\
      \xi_t(i, j) &= \frac{\alpha_t(i) \cdot a_{ij} \cdot b_j(Y_{t+1})\cdot \beta_{t+1}(j)}{P(Y)} \\
      \end{align}

      where


      \begin{align}
      \gamma_t(i) &= \sum_{j=1}^N \xi_t(i, j)
      \end{align}

      Here \(\gamma_t(i)\) is the prob­a­bil­i­ty of the sys­tem of being in state \(s_i\) at time \(t\) (giv­en the obser­va­tion sequence \(Y\) ) and \(\xi_t(i, j)\) (that’s Xi) is the prob­a­bil­i­ty of being in state \(s_i\) at time \(t\) while being in state \(s_j\) at time \(t+1\) , i.e. the tran­si­tion prob­a­bil­i­ty. It’s nat­ur­al that the sum of all prob­a­bil­i­ties of leav­ing a state \(s\) any­where is equal to the prob­a­bil­i­ty of being in that state (because, obvi­ous­ly, you have to be in that state in order to leave it).

      function [p] =  gamma(lambda, Y, t, i)
          T = numel(Y);
          assert(1 <= t && t <= T, '1 <= t <= T');
          
          p = alpha(lambda, Y, t, i) * beta(lambda, Y, t, i) / P(lambda, Y); 
      end
      
      function [p] =  xi(lambda, Y, t, i, j)
          T = numel(Y);
          assert(1 <= t && t < T, '1 <= t < T');
          
          p = alpha(lambda, Y, t, i) ...
            * lambda.A(i, j) * lambda.B(j, Y(t+1)) ...
            * beta(lambda, Y, t+1, j) ...
            / P(lambda, Y);
      end
      

      The Baum-Welch equations

      Giv­en these num­bers, the Baum-Welch equa­tions can be writ­ten as fol­lows.

      For the initial probabilities


      \begin{align}
      \tilde{\pi}_i &= \gamma_1(i)
      \end{align}

      Whis is sim­ple the prob­a­bil­i­ty of being in a giv­en state at time \(t=1\) .

      For the transition probabilities


      \begin{align}
      \tilde{a}_{ij} &= \frac{\sum_{t=1}^{T-1} \xi_t(i, j)}{\sum_{t=1}^{T-1} \gamma_t(i)}
      \end{align}

      Here, \(\sum_{t=1}^{T-1} \xi_t(i, j)\) is the prob­a­bil­i­ty of mov­ing from state \(s_i\) to \(s_j\) , where $\sum_{t=1}^{T-1} \gamma_t(i)$ is the prob­a­bil­i­ty of start­ing in state \(s_i\) , which nor­mal­izes it.

      And for the emission probabilities


      \begin{align}
      \tilde{b}_{ik} &= \frac{\sum_{t=1,Y_t = v_k}^{T} \gamma_t(i)}{\sum_{t=1}^{T} \gamma_t(i)} = \frac{\sum_{t=1}^{T} \begin{cases} 0,& \text{if}\, Y_t \neq v_k \\ \gamma_t(i),& \text{if}\, Y_t = v_k \end{cases}}{\sum_{t=1}^{T} \gamma_t(i)}
      \end{align}

      Here \(\sum_{t=1,Y_t = v_k}^{T} \gamma_t(i)\) is the sum of prob­a­bil­i­ties of being in state \(s_i\) only where an obser­va­tion $Y_t = v_k$ was made. Again, \(\sum_{t=1}^{T} \gamma_t(i)\) nor­mal­izes it.

      The optimized new HMM

      Giv­en these three para­me­ters, we obtain an opti­mized Hid­den Markov Mod­el

      \(
      \tilde{\lambda} = \left\{\mathcal{S}; \mathcal{V}; \tilde{\underline{A}}; \tilde{\underline{B}}; \tilde{\vec{\pi}}\right\}.
      \)

      These last steps are repeat­ed until the dif­fer­ence in the mea­sure \(P(Y)\) , \(P(\mathcal{C})\) or \(L(\mathcal{C})\)
      is small­er than a giv­en thresh­old (or the num­ber of iter­a­tion exceeds a cer­tain count, whichev­er hap­pens first). And that’s that.

      September 1st, 2014 GMT +2 von
      Markus
      2014-09-1T03:32:04+02:00 2018-03-4T15:11:15+02:00 · 0 Kommentare
      Baum-Welch
      MATLAB Data Science Machine Learning Hidden Markov Models

      Hinterlasse einen Kommentar

      Hier klicken, um das Antworten abzubrechen.

    1. « newer
    2. 1
    3. …
    4. 14
    5. 15
    6. 16
    7. 17
    8. 18
    9. 19
    10. 20
    11. …
    12. 43
    13. older »
    • Kategorien

      • .NET
        • ASP.NET
        • Core
        • DNX
      • Allgemein
      • Android
      • Data Science
      • Embedded
      • FPGA
      • Humor
      • Image Processing
      • Kalman Filter
      • Machine Learning
        • Caffe
        • Hidden Markov Models
        • ML Summarized
        • Neural Networks
        • TensorFlow
      • Mapping
      • MATLAB
      • Robotik
      • Rust
      • Signal Processing
      • Tutorial
      • Version Control
    • Neueste Beiträge

      • Summarized: The E-Dimension — Why Machine Learning Doesn’t Work Well for Some Problems?
      • Use your conda environment in Jupyter Notebooks
      • Building OpenCV for Anaconda Python 3
      • Using TensorFlow’s Supervisor with TensorBoard summary groups
      • Getting an image into and out of TensorFlow
    • Kategorien

      .NET Allgemein Android ASP.NET Caffe Core Data Science DNX Embedded FPGA Hidden Markov Models Humor Image Processing Kalman Filter Machine Learning Mapping MATLAB ML Summarized Neural Networks Robotik Rust Signal Processing TensorFlow Tutorial Version Control
    • Tags

      .NET Accelerometer Anaconda Bitmap Bug Canvas CLR docker FPGA FRDM-KL25Z FRDM-KL26Z Freescale git Gyroscope Integration Drift Intent J-Link Linear Programming Linux Magnetometer Matlab Mono Naismith OpenCV Open Intents OpenSDA Optimization Pipistrello Player/Stage PWM Python Sensor Fusion Simulink Spartan 6 svn tensorflow Tilt Compensation TRIAD ubuntu Windows Xilinx Xilinx SDK ZedBoard ZYBO Zynq
    • Letzte Kommetare

      • Lecke Mio bei Frequency-variable PWM generator in Simulink
      • Vaibhav bei Use your conda environment in Jupyter Notebooks
      • newbee bei Frequency-variable PWM generator in Simulink
      • Markus bei Using TensorFlow’s Supervisor with TensorBoard summary groups
      • Toke bei Using TensorFlow’s Supervisor with TensorBoard summary groups
    • Blog durchsuchen

    • September 2014
      M D M D F S S
      « Aug   Feb »
      1234567
      891011121314
      15161718192021
      22232425262728
      2930  
    • Self

      • Find me on GitHub
      • Google+
      • Me on Stack­Ex­change
      • Ye olde blog
    • Meta

      • Anmelden
      • Beitrags-Feed (RSS)
      • Kommentare als RSS
      • WordPress.org
    (Generiert in 0,640 Sekunden)

    Zurück nach oben.