Important Definitions

Bigram Model

A bigram model is a probabilistic model that assumes each element (such as a POS tag) depends only on the immediately preceding element. In POS tagging, it is based on the first-order Markov assumption.

Mathematically:

P(ti | t1, …, ti−1) ≈ P(ti | ti−1)

Transition Probability

A transition probability is the probability of one POS tag following another POS tag in a sequence.

Mathematically:

P(tj | ti)

It indicates how likely a tag tj occurs after tag ti.

Emission Probability

An emission probability is the probability that a given POS tag generates (emits) a specific word.

Mathematically:

P(w | t)

It represents how likely a word w is produced by a POS tag t.

11. Using the POS-tag HMM below, what is the most likely tag for the word "apple"?

Tags: Noun (N), Verb (V)
Emission Probabilities:
WordP(word|N)P(word|V)
eat0.050.60
apple0.700.10
Assume equal prior tag probability.

Answer: A
Explanation: P(apple|N)=0.70 ≫ P(apple|V)=0.10. With equal priors, **Noun** wins.
12. In the following transition matrix, what is the probability of the POS sequence N → V → N?
From \ ToNV
N0.300.70
V0.500.50
Initial probability P(N)=0.6

Answer: C
Explanation: P(N→V)=0.70, P(V→N)=0.50 ⇒ 0.6 × 0.70 × 0.50 = 0.21.

Given the transition probability matrix and as per the bigram model, the probability of the POS sequence N → V → N can be calculated by simply multiplying the bigram probabilities P(V | N) and P(N | V). (bigram probability is P(tagj | tagi)). The following are two valid calculations;

  • Without START state: P(N → V → N) = P(V | N) X P(N | V) = 0.70 X 0.50 = 0.35
  • With START state: P(N → V → N) = P(N | START) X P(V | N) X P(N | V) = 0.6 X 0.70 X 0.50 = 0.21

In this question, you are given bigram transition probabilities and initial (START) probability.

13. Using HMM emissions below, what is the most probable tag for "runs"?
WordP(word|Verb)P(word|Noun)
runs0.650.20
dog0.100.75
Prior: P(N)=0.4, P(V)=0.6

Answer: A
Explanation: 0.65×0.6 = 0.39 ≫ 0.20×0.4 = 0.08 ⇒ Verb.

The question asks for the tag (Verb or Noun) that is more likely to have generated the word "runs".

How do we decide the tag?

In Hidden Markov Models (HMMs), for a single word, we compare:

P(tag) × P(word | tag)

Step-by-step calculation

1. Probability that runs is a Verb

P(V) × P(runs | V) = 0.6 × 0.65 = 0.39

2. Probability that runs is a Noun

P(N) × P(runs | N) = 0.4 × 0.20 = 0.08

Compare the values

Since 0.39 > 0.08, Verb is the more probable tag.

14. You observe the sequence "cats chase mice". Which tag is most probable for chase?
WordNounVerb
cats0.650.05
chase0.100.80
mice0.700.05

The table can be read as P(cats | Noun) = 0.65 and P(cats | Verb) = 0.05 and so on


Answer: B
Explanation: P(chase|Verb)=0.80 ≫ P(chase|Noun)=0.10.

You have an observation sequence "cats chase mice" and need to determine which part-of-speech tag (Noun or Verb) is most probable for the word "chase".

The table provides emission probabilities — the probability of observing a particular word given a specific POS tag

For the word "chase", we look at its emission probabilities:

  • P(chase | Noun) = 0.10 (probability of observing "chase" if it's a Noun)
  • P(chase | Verb) = 0.80 (probability of observing "chase" if it's a Verb)

Since 0.80 > 0.10, the word "chase" is much more likely to be a Verb than a Noun.

15. Given transitions and equal priors, what is P(N→N→V)?
From \ ToNV
N0.550.45
V0.300.70
P(N)=0.5

Answer: A
Explanation: 0.5 × 0.55 × 0.45 = 0.12375.
16. Which tag is more probable for "light" given context? Previous tag = ADJ

Answer: C
Explanation: Insufficient data.

What is required to answer this type of question?

To find the most probable tag for a given word, even when the previous tag is known, you need:

P(ti | ti−1) and P(wi | ti)

  • Transition probability → models tag sequence likelihood. (Example: P(NOUN | ADJ), P(ADJ | ADJ))
  • Emission probability → models word–tag compatibility. (Example: P(light | NOUN), P(light | ADJ))

Both are required for a valid probabilistic decision in an HMM. But none of these probabilities are provided in the question.

17. In an HMM-based POS tagger, how is the tag of the first word in a sentence primarily determined?

Answer: C
Explanation: In an HMM, the best tag for a sentence start is mainly determined by the START → tag transition probability, and Determiners most frequently follow the START state in training data.
At the start of a sentence, there is no previous real tag. Instead, HMM uses a special START state. So for the first word, the tag is chosen mainly using: P(tag∣START) This probability is learned from training data by counting which tags most often begin sentences.
18. Compute likelihood of sequence DT → N → V given the initial P(DT) = 0.55.
From \ ToDTNV
DT0.100.750.15
N0.050.600.35
V0.100.200.70



Answer: A
Explanation: 0.55×0.75×0.35 = 0.144.

What does “likelihood of the sequence DT → N → V” mean?

It means: What is the probability that the HMM generates this tag sequence?

In an HMM, the likelihood of a tag sequence is computed by multiplying the relevant transition probabilities.

Formula to be used

P(DT → N → V) = P(DT | START) × P(N | DT) × P(V | N)


Substituting values

= 0.55 × 0.75 × 0.35 = = 0.144

19. Which tag is most probable for the word "book" given emissions P(book | Noun) = 0.45 and P(book | Verb) = 0.40? Transition probabilities for fixed previous tag VERB are as follows: P(N | V) = 0.55 and P(V | V) = 0.45.

Answer: A
Explanation: 0.55×0.45 = 0.2475 greater thatn 0.45×0.40 = 0.18 ⇒ Noun.

What does HMM compute?

For each possible tag t, an HMM compares:

P(t | previous tag) × P(word | t)

Step 1: Probability of Noun

P(N) = P(V → N) × P(book | N) = 0.55 × 0.45 = 0.2475

Step 2: Probability of Verb

P(V) = P(V → V) × P(book | V) = 0.45 × 0.40 = 0.18

Since probability of NOUN is greater than that of VERB, NOUN is probable.

20. If emission probability for unknown words is smoothed using Laplace smoothing, what happens?

Answer: B
Explanation: Laplace smoothing prevents zero probabilities.

What is Laplace smoothing?

Laplace smoothing (add-one smoothing) is used to handle unseen events in probabilistic models.

In POS tagging with HMMs, an unknown word is a word that did not appear in the training data.

Without smoothing:

  • P(unknown word | tag) = 0
  • This would make the entire sequence probability zero, even if all other probabilities are high.

What does Laplace smoothing do?

Laplace smoothing adds 1 to all word counts:

As a result:

  • Even unseen words receive a small, non-zero probability
  • No emission probability becomes zero

Therefore, the correct answer is: Option B