Effect of Laplace smoothing on emission probabilities of unknown words

It assigns a small non-zero probability

POS Tagging using HMM Solved exercises 02 – Multiple Choice Questions (MCQs) with Answers

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.

Important Definitions

Bigram Model

A bigram model is a probabilistic model that assumes each element (such as a POS tag) depends only on the immediately preceding element. In POS tagging, it is based on the first-order Markov assumption.

Mathematically:

P(t_i | t₁, …, t_i−1) ≈ P(t_i | t_i−1)

Transition Probability

A transition probability is the probability of one POS tag following another POS tag in a sequence.

Mathematically:

P(t_j | t_i)

It indicates how likely a tag t_j occurs after tag t_i.

Emission Probability

An emission probability is the probability that a given POS tag generates (emits) a specific word.

Mathematically:

P(w | t)

It represents how likely a word w is produced by a POS tag t.

11. Using the POS-tag HMM below, what is the most likely tag for the word "apple"?

Tags: Noun (N), Verb (V)
Emission Probabilities:

Word	P(word\|N)	P(word\|V)
eat	0.05	0.60
apple	0.70	0.10

Assume equal prior tag probability.

Noun (N) Verb (V) Both equal probability Cannot be inferred

Answer: A
Explanation: P(apple|N)=0.70 ≫ P(apple|V)=0.10. With equal priors, **Noun** wins.

12. In the following transition matrix, what is the probability of the POS sequence N → V → N?

From \ To	N	V
N	0.30	0.70
V	0.50	0.50

Initial probability P(N)=0.6

0.6 × 0.70 × 0.30 = 0.126 0.6 × 0.30 × 0.70 = 0.126 0.6 × 0.70 × 0.50 = 0.21 0.6 × 0.50 × 0.30 = 0.09

Answer: C
Explanation: P(N→V)=0.70, P(V→N)=0.50 ⇒ 0.6 × 0.70 × 0.50 = 0.21.

Given the transition probability matrix and as per the bigram model, the probability of the POS sequence N → V → N can be calculated by simply multiplying the bigram probabilities P(V | N) and P(N | V). (bigram probability is P(tag_j | tag_i)). The following are two valid calculations;

Without START state: P(N → V → N) = P(V | N) X P(N | V) = 0.70 X 0.50 = 0.35
With START state: P(N → V → N) = P(N | START) X P(V | N) X P(N | V) = 0.6 X 0.70 X 0.50 = 0.21

In this question, you are given bigram transition probabilities and initial (START) probability.

13. Using HMM emissions below, what is the most probable tag for "runs"?

Word	P(word\|Verb)	P(word\|Noun)
runs	0.65	0.20
dog	0.10	0.75

Prior: P(N)=0.4, P(V)=0.6

Verb Noun Both equal Need transition probability

Answer: A
Explanation: 0.65×0.6 = 0.39 ≫ 0.20×0.4 = 0.08 ⇒ Verb.

The question asks for the tag (Verb or Noun) that is more likely to have generated the word "runs".

How do we decide the tag?

In Hidden Markov Models (HMMs), for a single word, we compare:

P(tag) × P(word | tag)

Step-by-step calculation

1. Probability that runs is a Verb

P(V) × P(runs | V) = 0.6 × 0.65 = 0.39

2. Probability that runs is a Noun

P(N) × P(runs | N) = 0.4 × 0.20 = 0.08

Compare the values

Since 0.39 > 0.08, Verb is the more probable tag.

14. You observe the sequence "cats chase mice". Which tag is most probable for chase?

Word	Noun	Verb
cats	0.65	0.05
chase	0.10	0.80
mice	0.70	0.05

The table can be read as P(cats | Noun) = 0.65 and P(cats | Verb) = 0.05 and so on

Noun Verb Equally probable None

Answer: B
Explanation: P(chase|Verb)=0.80 ≫ P(chase|Noun)=0.10.

You have an observation sequence "cats chase mice" and need to determine which part-of-speech tag (Noun or Verb) is most probable for the word "chase".

The table provides emission probabilities — the probability of observing a particular word given a specific POS tag

For the word "chase", we look at its emission probabilities:

P(chase | Noun) = 0.10 (probability of observing "chase" if it's a Noun)
P(chase | Verb) = 0.80 (probability of observing "chase" if it's a Verb)

Since 0.80 > 0.10, the word "chase" is much more likely to be a Verb than a Noun.

15. Given transitions and equal priors, what is P(N→N→V)?

From \ To	N	V
N	0.55	0.45
V	0.30	0.70

P(N)=0.5

0.5 × 0.55 × 0.45 = 0.12375 0.5 × 0.45 × 0.55 = 0.12375 0.5 × 0.55 × 0.55 = 0.15125 0.5 × 0.45 × 0.45 = 0.10125

Answer: A
Explanation: 0.5 × 0.55 × 0.45 = 0.12375.

16. Which tag is more probable for "light" given context? Previous tag = ADJ

ADJ NOUN Insufficient data Both equal

Answer: C
Explanation: Insufficient data.

What is required to answer this type of question?

To find the most probable tag for a given word, even when the previous tag is known, you need:

P(t_i | t_i−1) and P(w_i | t_i)

Transition probability → models tag sequence likelihood. (Example: P(NOUN | ADJ), P(ADJ | ADJ))
Emission probability → models word–tag compatibility. (Example: P(light | NOUN), P(light | ADJ))

Both are required for a valid probabilistic decision in an HMM. But none of these probabilities are provided in the question.

17. In an HMM-based POS tagger, how is the tag of the first word in a sentence primarily determined?

Emission probability only Transition probability from the previous word Transition probability from the START state Random selection

Answer: C
Explanation: In an HMM, the best tag for a sentence start is mainly determined by the START → tag transition probability, and Determiners most frequently follow the START state in training data.
At the start of a sentence, there is no previous real tag. Instead, HMM uses a special START state. So for the first word, the tag is chosen mainly using: P(tag∣START) This probability is learned from training data by counting which tags most often begin sentences.

18. Compute likelihood of sequence DT → N → V given the initial P(DT) = 0.55.

From \ To	DT	N	V
DT	0.10	0.75	0.15
N	0.05	0.60	0.35
V	0.10	0.20	0.70

0.55×0.75×0.35 = 0.144 0.55×0.10×0.35 = 0.01925 0.55×0.75×0.15 = 0.0618 0.55 × 0.15 × 0.70 = 0.05775

Answer: A
Explanation: 0.55×0.75×0.35 = 0.144.

What does “likelihood of the sequence DT → N → V” mean?

It means: What is the probability that the HMM generates this tag sequence?

In an HMM, the likelihood of a tag sequence is computed by multiplying the relevant transition probabilities.

Formula to be used

P(DT → N → V) = P(DT | START) × P(N | DT) × P(V | N)

Substituting values

= 0.55 × 0.75 × 0.35 = = 0.144

19. Which tag is most probable for the word "book" given emissions P(book | Noun) = 0.45 and P(book | Verb) = 0.40? Transition probabilities for fixed previous tag VERB are as follows: P(N | V) = 0.55 and P(V | V) = 0.45.

Noun Verb Both equal Not enough data

Answer: A
Explanation: 0.55×0.45 = 0.2475 greater thatn 0.45×0.40 = 0.18 ⇒ Noun.

What does HMM compute?

For each possible tag t, an HMM compares:

P(t | previous tag) × P(word | t)

Step 1: Probability of Noun

P(N) = P(V → N) × P(book | N) = 0.55 × 0.45 = 0.2475

Step 2: Probability of Verb

P(V) = P(V → V) × P(book | V) = 0.45 × 0.40 = 0.18

Since probability of NOUN is greater than that of VERB, NOUN is probable.

20. If emission probability for unknown words is smoothed using Laplace smoothing, what happens?

Unknown word gets zero probability Unknown word receives a small non-zero probability Transition matrix removed HMM becomes rule-based

Answer: B
Explanation: Laplace smoothing prevents zero probabilities.

What is Laplace smoothing?

Laplace smoothing (add-one smoothing) is used to handle unseen events in probabilistic models.

In POS tagging with HMMs, an unknown word is a word that did not appear in the training data.

Without smoothing:

P(unknown word | tag) = 0
This would make the entire sequence probability zero, even if all other probabilities are high.

What does Laplace smoothing do?

Laplace smoothing adds 1 to all word counts:

As a result:

Even unseen words receive a small, non-zero probability
No emission probability becomes zero

Therefore, the correct answer is: Option B

Major links

Quicklinks

Monday, December 15, 2025