CS 312 Lecture 17
Hidden Markov Models

(See L. Rabiner, "A Tutorial on Hidden Markov Models ...", Proc. IEEE, Feb. 1989.)

We're going to spend a few lectures and recitations considering some nontrivial algorithms and data structures that illustrate some more advanced techniques such as dynamic programming and amortized analysis.  Today we will investigate hidden Markov models, which are widely used in applications from speech recognition, to bioinformatics, to data mining, to decoding the convolutional codes used in digital data transmission (such as the CDMA and GSM cell phone standards).

Let's start with a basic example that illustrates the use of hidden Markov models.  Consider a situation in which you are able to observe someone tossing a coin.   Moreover you know that they actually have two coins, a fair coin with .5 probability of heads and of tails and an unfair coin with probability .6 of heads and .4 of tails.  At each time step they decide with some probability whether to toss the current coin they are tossing or to switch coins and toss the other coin.  We assume that this decision is made independently at each time step (that is they are not doing something like counting the number of times that a given coin has been tossed before switching).  While the person switches coins, you don't know when they do so because they are good at slight of hand.  Nonetheless, you do know that they don't switch coins that often because despite their skill, switching coins undetected is considerably more difficult than continuing to toss the same coin.  From the sequence of heads and tails, which is all you can observe, what can you tell about when they are tossing the fair coin versus the unfair coin?

Clearly the true coin being tossed is not visible to you.  This is referred to as the underlying state of the system being observed, and is where the name hidden model comes from.  In our example there are two states, we could call them fair and unfair, or F and U for short.

Another important property of this example, which may not be as clear at first, is that the state at a given time t depends only on the state at the previous time, t-1.  That is, the history of states does not matter, only the previous state.  This is referred to as a first-order Markov property because only one previous state matters; in an n-th order model n previous states matter.  Hence the name Markov model.  For a first order Markov model we can represent the transitions from time t-1 to time t as a function from the set of states to the set of states.  A natural such representation is a matrix where the entry i,j corresponds to the transition from the "ith state" to the "jth state". For an n-th order model it would be a function mapping from n-tuples of states to states.

We will limit ourselves to the case of finite models, where the set of hidden states and the set of observable events are finite (and in practice relatively small as well),

Below is a graphical illustration of a simple two state model, where the self transitions are probability .9 and the state changes are probability .1. 

  +---+        +---+
  |   |.9    .9|   |
  +-> F <----> U <-+
        .1  .1

However, unlike other finite state models we have considered this model does not simply emit or recognize sequences of the states F and U.  Instead, the observable symbols or events are the heads and tails of the coins tosses, or H and T.  Thus in state F, with probability .5 we see an H and with probability .5 we see a T.  In state U with probability .6 we see an H and with probability .4 we see a T.  We can think of there being a little table next to each state with these probabilities:

  +---+            +---+
  |   |.9        .9|   |
  +-> F   <---->   U <-+
    H .5  .1  .1   H .6
    T .5           T. 4

From an HMM like this we can generate sequences of states and corresponding observable symbols by running the model.  To start, we need to know what the probability is of starting in either the F or U state.  Lets say they are the same.  So initially we pick F or U with equal probability and also emit H or T with the probability specified for that state. Then with probability .9 we transition to the same state, and with probability .1 to the other state.  Again emit the observable symbol H or T with the corresponding probability for that state, and so on.

To be more precise, we can define a hidden Markov model to be the 5-tuple H=(S,V,A,B,N) where

S={s1,...,sn} is a set of hidden states
V={v1,...,vm} is a set of visible observations or events (symbols)
A is an nxn matrix where entry aij corresponds to the probability of transitioning from state si to sj
B={b1(k),...,bn(k)} is a set of m-vectors where bi(k) is the probability of observing vk when in state si
N is an n-vector where N(i) specifies the probability that the system is initially in state si

We can map our two coin example above to this notation, with states s1=F,s2=U, observations v1=H,v2=T, a11=a22=.9, a12=a21=.1, b1=(.5,.5), b2=(.6,.4), N=(.5,.5)

We can use such an HMM to generate a sequence of states Q=q1,...,qt where each qi is in S and corresponding sequence of observations O=o1,...,ot where each oi is in V. 

A more interesting question, however, is given a model H and an observation sequence O, what is the "best" corresponding state sequence Q.  This would answer our initial question for the two coin problem, of being able to detect when the coins are being changed despite only being able to observe the sequence of heads and tails.

There are a number of other interesting questions, such as how to learn or set parameters for a model that are appropriate to a given set of observed sequences (e.g., a model of a given person - what is their probability of switching coins and what are the probabilities for their coins).  This kind of problem is beyond the scope of this class but is considered in machine learning and artificial intelligence courses. Here we will take the HMM as given and consider the problem of determining a best state sequence Q given an observation sequence O and a model H.

There are several ways of defining a best or optimal state sequence Q given an observation sequence O.  The most common is to find a sequence of states that maximizes the probability P(Q|O,H).  That is, given an observation sequence of length T, there are n^T possible state sequences.  Each such state sequence has a probability given the observation sequence O and the HMM H.  Obviously it wont be efficient to compute the best sequence by exhaustively considering all sequences, because there are exponentially many.  We will use a dynamic programming approach to do this more efficiently once we understand the problem.

Without any observations we can consider the probability of each state sequence given just the HMM.  This is commonly referred to as the prior probability, P(Q|H).   For sequences of length 3 and the example HMM above we have:

FFF .5*.9*.9=.405    N(s1)*a11*a11
FFU .5*.9*.1=.045
FUU .5*.1*.9=.045
FUF .5*.1*.1=.005
UUU .5*.9*.9=.405
UUF .5*.9*.1=.045
UFF .5*.1*.9=.045
UFU .5*.1*.1=.005

For each such state sequence we can compute the probability of some given observation sequence, say HTH using the observation probabilities for the states.  This is commonly referred to as the likelihood P(O|Q,H)  The product of this with the prior probability gives us the overall probability of the state sequence given the model and observation sequence.

For HTH

FFF .5*.5*.5=.125
FFU .5*.5*.6=.15
FUU .5*.4*.6=.12
FUF .5*.4*.5=.1
UUU .6*.4*.6=.144
UUF .6*.4*.5=.12
UFF .6*.5*.5=.15
UFU .6*.5*.6=.18

Combined:

FFF .050625
FFU .00675
FUU .0054
FUF .0005
UUU .05832
UUF .0054
UFF .00675
UFU .0009

Note we have a strong prior that prefers staying in the same state and a weak likelihood that does not have strongly different observation probabilities for the different states, so the prior dominates the result in this example.  Nonetheless UUU is quite a bit better than FFF.

More formally, we are computing

max  P(Q|O,H)   =  max  P(Q|H) P(O|Q,H)    using Bayes' rule and fact that we are maximizing
Q                             Q

[If you don't know the probability identities don't worry about it.  Bayes rule is P(A|B) = (P(B|A)P(A))/P(B).

Fixing H, we have P(Q|O) = (P(O|Q)P(Q))/P(O), but we can ignore the denominator because its positive and we are maximizing.]

Note that even for a short sequence these overall probabilities are quite small, and for even moderate length sequences underflow occurs.  Thus it is common to instead compute negative log probabilities and minimize the resulting sum, rather than maximizing the product of probabilities.

However, let's stay with the standard definition in terms of probabilities for a moment.  Clearly we do not want to explicitly consider each state sequence Q.  However, that is not necessary, because of the first order Markov property.  Given the probability of being in a particular state at time t-1 we can calculate the probability of each next state at time without needing to consider the entire sequence of states that led up to the state at t-1.

Consider dt(i), the highest probability (best score) along any single path that accounts for the first t observations and ends in state Si at step t.

dt(i) =          max      P(q1,...,qt=i | o1,..,ot, H)
               q1,...,qt-1

By induction

dt+1(j) = [max  dt(i) aij] bj(Ot+1)
                  i

To compute the state sequence, and not just the probability, we need to keep track of the argument that maximized the previous equation for each t and j.  Let rt(j) denote this.  Then the following dynamic programming algorithm, commonly called the Viterbi algorithm after its inventor, computes both the maximizing state sequence and its probability:

1) Initialization (for 1 <= i <= n)

d1(i) = N(i)bi(o1)
r1(i) = 0

2) Recursion (for 2<=t<=T and 1<=j<=n)

dt(j) =    max       dt-1(i) aij bj(Ot)
          1<=i<=n

rt(j) =   argmax   dt-1(i) aij
          1<=i<=n

3) Termination

p* =       max       dT(i)
          1<=i<=n
 

qT* =  argmax     dT(i)
          1<=i<=n

4) Path backtracking, recover state sequence (for t =T-1, ... 1)

qt* = rt+1(qt+1*)

Note that the running time of this algorithm is O(T n^2), as opposed to the brute force of considering O(n^T) possible sequences - because this method operates by extending a sequence using the Markov property.

Often the technique is referred to as a "trellis", because to compute dt+1(j) from dt(j) involves considering all possible transitions from each state, and maximizing.

Schematically, for two hidden states

+--+     +--+     +--+       
|  |     |  |     |  |
+--+     +--+     +--+
|  |     |  |     |  |
+--+     +--+     +--+
dt(j)   aij*B    dt+1(j)
 

Where for each cell of dt(j) consider each cell of aij*B, and use the maximum to compute dt+1(j) [that is, think of a full bipartite graph between dt(j) and aij*B].

As noted above, in practice this is done by minimizing a sum of negative log probabilities rather than maximizing a product of probabilities.