✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Key Concepts Illustrated in the Figure
-
Visible states (Observations)
Visible states are the observed outputs of an HMM, such as words in a sentence. In the above figure, 'cat', 'purrs', etc are observations. -
Hidden states
Hidden states are the unobserved underlying states (e.g., POS tags - 'DT', 'N', etc in the figure) that generate the visible observations. -
Transition probabilities
Transition probabilities define the likelihood of moving from one hidden state to another. In the figure, this is represented by the arrows from one POS tag to the other. Example: P(N -> V) or P(V | N). -
Emission probabilities
Emission probabilities define the likelihood of a visible observation being generated by a hidden state. In the figure, this is represented by the arrows from POS tags to words. Example: P(cat | N). -
POS tagging using HMM
POS tagging using HMM models tags as hidden states and words as observations to find the most probable tag sequence. -
Evaluation problem
The evaluation problem computes the probability of an observation sequence given an HMM. -
Forward algorithm
The forward algorithm efficiently solves the evaluation problem using dynamic programming. -
Decoding problem
The decoding problem finds the most probable hidden state sequence for a given observation sequence.
In HMM-based POS tagging, tags are hidden states and words are observed symbols.
Viterbi decoding finds the most probable hidden tag sequence.
Transition probability models tag-to-tag dependency. That is, the probability of a tag t given another tag t-1 which is previous tag. It is calculated using Maximum Likelihood Estimation (MLE) as follows;
Maximum Likelihood Estimation (MLE)
When the state sequence is known (for example, in POS tagging with labeled training data), the transition probability is estimated using Maximum Likelihood Estimation.
aij = Count(ti → tj) / Count(ti)
Where:
- Count(ti → tj) is the number of times a POS tag ti is immediately followed by a POS tag tj in the training data.
- Count(ti) is the total number of appearence of tag ti in the entire training data.
This estimation ensures that the transition probabilities for each state sum to 1.
For example, the transition probability P(Noun | Det) will be 6/10 or 0.6 if in the training corpus the tag sequence "Det Noun" (Eg. like in "The/Det cat/Noun" - this is called tagged training data) occurs 6 times and the tag "Det" alone appears 10 times overall.
Emission probability is P(word | tag).
It answer the question "Given a particular POS tag, how likely is it that this tag generates (emits) a specific word?"
Emission probability calculation: Out of the total number of times a tag appears in the training data (eg. NOUN), how many times it appears as the tag of a given word (eg. "cat/NOUN").
Baum–Welch (EM) learns transition and emission probabilities without labeled data.
Baum–Welch Method
The Baum–Welch method is an algorithm used to train a Hidden Markov Model (HMM) when the true state (tag) sequence is unknown.
What does the Baum–Welch method do?
It estimates (learns) the transition and emission probabilities of an HMM from unlabeled data.
In Simple Terms
- You are given only observation sequences (e.g., words)
- You do not know the hidden state sequence (e.g., POS tags)
- Baum–Welch automatically learns the model parameters that best explain the data
The Baum–Welch method is used to train an HMM by estimating transition and emission probabilities from unlabeled observation sequences using EM. Baum-Welch is a special case of Expectation-Maximization (EM) algorithm.
Rows correspond to tags and columns to words.
In an HMM POS tagger, the emission matrix represents:
P(word | tag)
So its dimensions are:
- Rows = number of tags
- Columns = vocabulary size
Given:
- Number of tags = 50
- Vocabulary size = 20,000
Emission matrix size:
50 × 20,000
Trigram models capture dependency on two previous tags.
Trigram Model
A trigram model assumes that the probability of a tag (or word) depends on the previous two tags.
P(ti | ti−1, ti−2)
In POS tagging using an HMM:
- Transition probabilities are computed using trigrams of tags
- The model captures more context than unigram or bigram models
Example:
If the previous two tags are DT and NN, the probability of the next tag VB is:
P(VB | DT, NN)
Note: In practice, smoothing and backoff are used because many trigrams are unseen.
Unseen words lead to zero emission probabilities without smoothing.
Data sparsity in emission probabilities means that many valid word–tag combinations were never seen during training, so their probabilities are zero or unreliable.
Data sparsity may occur due to one or more of the following;- Natural language has a very large vocabulary.
- Training data is finite.
- New or rare words often appear during test time.
Smoothing assigns non-zero probabilities to unseen events.
Refer here for more information about Laplace smoothing.
Each token is labeled sequentially → classic sequence labeling.
POS Tagging as a Sequence Labeling Task
POS tagging is a sequence labeling task because the goal is to assign a label (POS tag) to each element in a sequence (words in a sentence) while considering their order and context.
What is Sequence Labeling?
In sequence labeling, we:
-
Take an input sequence:
w1, w2, …, wn -
Produce an output label sequence:
t1, t2, …, tn
Each input item receives one corresponding label, and the labels are not independent of each other.
POS Tagging as Sequence Labeling
Input sequence → words in a sentence
The / cat / sleeps
Output sequence → POS tags
DT / NN / VBZ
Each word must receive exactly one POS tag, and the choice of tag depends on:
- The current word (emission probability)
- The neighboring tags (context / transition probability)