✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.
Important Definitions
Bigram Model
A bigram model is a probabilistic model that assumes each element (such as a POS tag) depends only on the immediately preceding element. In POS tagging, it is based on the first-order Markov assumption.
Mathematically:
Transition Probability
A transition probability is the probability of one POS tag following another POS tag in a sequence.
Mathematically:
It indicates how likely a tag tj occurs after tag ti.
Emission Probability
An emission probability is the probability that a given POS tag generates (emits) a specific word.
Mathematically:
It represents how likely a word w is produced by a POS tag t.
Tags: Noun (N), Verb (V)
Emission Probabilities:
| Word | P(word|N) | P(word|V) |
|---|---|---|
| eat | 0.05 | 0.60 |
| apple | 0.70 | 0.10 |
Explanation: P(apple|N)=0.70 ≫ P(apple|V)=0.10. With equal priors, **Noun** wins.
| From \ To | N | V |
|---|---|---|
| N | 0.30 | 0.70 |
| V | 0.50 | 0.50 |
Explanation: P(N→V)=0.70, P(V→N)=0.50 ⇒ 0.6 × 0.70 × 0.50 = 0.21.
Given the transition probability matrix and as per the bigram model, the probability of the POS sequence N → V → N can be calculated by simply multiplying the bigram probabilities P(V | N) and P(N | V). (bigram probability is P(tagj | tagi)). The following are two valid calculations;
- Without START state: P(N → V → N) = P(V | N) X P(N | V) = 0.70 X 0.50 = 0.35
- With START state: P(N → V → N) = P(N | START) X P(V | N) X P(N | V) = 0.6 X 0.70 X 0.50 = 0.21
In this question, you are given bigram transition probabilities and initial (START) probability.
| Word | P(word|Verb) | P(word|Noun) |
|---|---|---|
| runs | 0.65 | 0.20 |
| dog | 0.10 | 0.75 |
Explanation: 0.65×0.6 = 0.39 ≫ 0.20×0.4 = 0.08 ⇒ Verb.
The question asks for the tag (Verb or Noun) that is more likely to have generated the word "runs".
How do we decide the tag?
In Hidden Markov Models (HMMs), for a single word, we compare:
Step-by-step calculation
1. Probability that runs is a Verb
2. Probability that runs is a Noun
Compare the values
Since 0.39 > 0.08, Verb is the more probable tag.
| Word | Noun | Verb |
|---|---|---|
| cats | 0.65 | 0.05 |
| chase | 0.10 | 0.80 |
| mice | 0.70 | 0.05 |
The table can be read as P(cats | Noun) = 0.65 and P(cats | Verb) = 0.05 and so on
Explanation: P(chase|Verb)=0.80 ≫ P(chase|Noun)=0.10.
You have an observation sequence "cats chase mice" and need to determine which part-of-speech tag (Noun or Verb) is most probable for the word "chase".
The table provides emission probabilities — the probability of observing a particular word given a specific POS tag
For the word "chase", we look at its emission probabilities:
- P(chase | Noun) = 0.10 (probability of observing "chase" if it's a Noun)
- P(chase | Verb) = 0.80 (probability of observing "chase" if it's a Verb)
Since 0.80 > 0.10, the word "chase" is much more likely to be a Verb than a Noun.
| From \ To | N | V |
|---|---|---|
| N | 0.55 | 0.45 |
| V | 0.30 | 0.70 |
Explanation: 0.5 × 0.55 × 0.45 = 0.12375.
Explanation: Insufficient data.
What is required to answer this type of question?
To find the most probable tag for a given word, even when the previous tag is known, you need:
P(ti | ti−1) and P(wi | ti)
- Transition probability → models tag sequence likelihood. (Example: P(NOUN | ADJ), P(ADJ | ADJ))
- Emission probability → models word–tag compatibility. (Example: P(light | NOUN), P(light | ADJ))
Both are required for a valid probabilistic decision in an HMM. But none of these probabilities are provided in the question.
Explanation: In an HMM, the best tag for a sentence start is mainly determined by the START → tag transition probability, and Determiners most frequently follow the START state in training data.
At the start of a sentence, there is no previous real tag. Instead, HMM uses a special START state. So for the first word, the tag is chosen mainly using: P(tag∣START) This probability is learned from training data by counting which tags most often begin sentences.
| From \ To | DT | N | V |
|---|---|---|---|
| DT | 0.10 | 0.75 | 0.15 |
| N | 0.05 | 0.60 | 0.35 |
| V | 0.10 | 0.20 | 0.70 |
Explanation: 0.55×0.75×0.35 = 0.144.
What does “likelihood of the sequence DT → N → V” mean?
It means: What is the probability that the HMM generates this tag sequence?
In an HMM, the likelihood of a tag sequence is computed by multiplying the relevant transition probabilities.
Formula to be used
P(DT → N → V) = P(DT | START) × P(N | DT) × P(V | N)
Substituting values
= 0.55 × 0.75 × 0.35 = = 0.144
Explanation: 0.55×0.45 = 0.2475 greater thatn 0.45×0.40 = 0.18 ⇒ Noun.
What does HMM compute?
For each possible tag t, an HMM compares:
P(t | previous tag) × P(word | t)
Step 1: Probability of Noun
P(N) = P(V → N) × P(book | N) = 0.55 × 0.45 = 0.2475
Step 2: Probability of Verb
P(V) = P(V → V) × P(book | V) = 0.45 × 0.40 = 0.18
Since probability of NOUN is greater than that of VERB, NOUN is probable.
Explanation: Laplace smoothing prevents zero probabilities.
What is Laplace smoothing?
Laplace smoothing (add-one smoothing) is used to handle unseen events in probabilistic models.
In POS tagging with HMMs, an unknown word is a word that did not appear in the training data.
Without smoothing:
- P(unknown word | tag) = 0
- This would make the entire sequence probability zero, even if all other probabilities are high.
What does Laplace smoothing do?
Laplace smoothing adds 1 to all word counts:
As a result:
- Even unseen words receive a small, non-zero probability
- No emission probability becomes zero
Therefore, the correct answer is: Option B
No comments:
Post a Comment