Thursday, December 11, 2025

Top 10 Advanced HMM for POS Tagging — Important MCQs (2nd Order, Trigram, Smoothing, Viterbi)

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

Advanced POS Tagging with Hidden Markov Models — MCQ Introduction

Master POS tagging with this focused MCQ set on Hidden Markov Models (HMM): second-order & trigram models, Viterbi decoding, Baum-Welch training, smoothing techniques, and practical tips for supervised & unsupervised learning.

Part-of-speech (POS) tagging is a cornerstone task in Natural Language Processing (NLP) that assigns grammatical categories (noun, verb, adjective, etc.) to each token in a sentence. This question set concentrates on classical statistical taggers built with Hidden Markov Models (HMMs), which model tag sequences as hidden states and observed words as emissions.

HMM-based POS taggers remain valuable for their interpretability and efficiency. They are particularly useful when you need:

  • Lightweight, fast taggers for resource-constrained systems
  • Explainable probabilistic models for linguistic analysis and teaching
  • Strong baselines before moving to neural models like BiLSTM or Transformer taggers

This MCQ collection targets advanced HMM concepts — including second-order (trigram) models, the Viterbi decoding algorithm, Forward–Backward / Baum–Welch for unsupervised learning, and various smoothing strategies to handle rare or unseen words. Each question includes a concise explanation to help you understand not only the correct choice but why it matters for real-world POS tagging.

What you’ll learn

  1. How higher-order HMMs (trigram / second-order) capture broader tag context.
  2. Why supervised training requires labeled word–tag corpora and how unsupervised EM works.
  3. The purpose of smoothing (Laplace, Good-Turing) to avoid zero probabilities.
  4. Trade-offs: model complexity, overfitting, and inference cost when increasing hidden states.

Use these MCQs to prepare for exams, interviews, or to evaluate your grounding before progressing to neural POS taggers. Scroll down to start the questions and test your understanding of HMM-based POS tagging fundamentals and advanced techniques.

11. A second-order POS HMM considers:

Answer: B
Explanation:

Higher-order HMM models allow P(tᵢ | tᵢ₋₁, tᵢ₋₂) improving context.

A second-order POS HMM (Hidden Markov Model) is also called a trigram HMM.

It means: P(tᵢ | tᵢ₋₁, tᵢ₋₂).

This tells us: The probability of the current tag depends on two previous tags. Therefore, the HMM "looks back" two steps in the tag sequence.

Example: Let us take a simple sentence "dog chase cats". For the last word “cats”, whose tag is t₃, and given the previous two tags: t₂ = VERB (tag for “chase”) t₁ = NOUN (tag for “dogs”), you should write:

𝑃(𝑡cats ∣ 𝑡chase = VERB, 𝑡dogs = NOUN)
12. Input for supervised POS HMM training must contain:

Answer: B
Explanation:

Supervised models need labeled corpora to learn emission + transition probabilities.

Supervised POS HMM — Why labeled data is required?

A supervised POS HMM needs already labeled data so it can learn probabilities:

Transition probabilities
P(tk | tk-1) or P(tk | tk-1, tk-2)

These require tag sequences.

Emission probabilities
P(wk | tk)

These require each word paired with its correct tag.

Therefore, supervised training must have sentences where every word already has a POS tag.

Example training line:
Dogs/NOUN chase/VERB cats/NOUN
13. Decoding in POS HMM refers to:

Answer: C
Explanation:

Decoding maps observed words to best hidden tag sequence using Viterbi.

In a POS HMM (Hidden Markov Model), decoding means: Finding the most probable sequence of POS tags for a given sequence of words. This is usually done using the Viterbi algorithm.

So, decoding = tagging = choosing the best tag path.
14. Forward-Backward algorithm is mainly used for:

Answer: B
Explanation:

It computes expected probabilities used in Baum-Welch EM training.

More information:

The Forward–Backward algorithm is the core of the Baum–Welch algorithm, which is used for: Unsupervised training of Hidden Markov Models (HMMs).

In unsupervised learning, the data has no POS tags, so the model must estimate: Transition probabilities, Emission probabilities

The Forward–Backward algorithm computes:

  • Forward probabilities α
  • Backward probabilities β
  • Expected counts for transitions and emissions
These expected counts are then used to re-estimate HMM parameters. This is EM (Expectation–Maximization).
15. Smoothing in HMM prevents:

Answer: D
Explanation:

Unseen tag-word or tag-tag pairs must not be zero → smoothing distributes probability.

In an HMM, probabilities are estimated from counts in the training data. If a transition or a word–tag pair never appears in training data, its probability becomes zero. This is dangerous because:

  • A zero probability wipes out entire Viterbi paths
  • The model cannot handle unseen words or unseen tag transitions

Smoothing (like Laplace, Good–Turing, Witten–Bell) adds a small nonzero probability to unseen events.

So smoothing prevents: Zero-probability transitions and zero-probability emissions.
16. A core limitation of POS HMM is that it:

Answer: B
Explanation:

HMM relies only on previous tag and assumes words depend only on tag, limiting context.

Markov + word independence - A limiation in HMM

A standard POS HMM makes two strong assumptions.

  • Markov Assumption (for tags): The current tag depends only on a small number of previous tags.
    • This ignores long-range syntactic dependencies (e.g., subject–verb agreement across clauses).
  • Output Independence Assumption (for words): Words depend only on their own tag, not surrounding words.
    • This ignores context that modern taggers use (e.g., CRFs, BiLSTMs, Transformers).

These assumptions simplify the model, but they also severely limit accuracy compared to modern NLP models.

17. Increasing hidden states in POS HMM generally:

Answer: A
Explanation:

More states = more parameters → risk of overfitting & slower inference.


Increasing hidden states in POS HMM may cause overfitting

In a Hidden Markov Model (HMM) used for Part-of-Speech (POS) tagging, the "hidden states" correspond to the POS tags (like Noun, Verb, Adjective). Increasing the number of hidden states means using a more granular tagset (e.g., splitting "Noun" into "Singular Noun" and "Plural Noun") or simply increasing the model's capacity in an unsupervised setting.


Effect of increasing hidden states - Discussion

When you increase the number of states N:

  • You must estimate many more parameters.
  • But your dataset size stays the same.

So the model tries to estimate:

  • Many more transition probabilities (N2),
  • Many more emission probabilities (N × V).

With limited data, the HMM begins to:

  • Fit the quirks/noise of the training data,
  • Memorize rare patterns,
  • Over-specialize to word sequences it has seen,
  • Lose its ability to generalize to unseen text.

This phenomenon is overfitting.

18. Rare words are best handled using:

Answer: A
Explanation:

Smoothing reallocates probability mass → better tagging for unseen/low-freq words.

19. A trigram HMM improves tagging by modeling:

Answer: B
Explanation:

Trigram uses P(tᵢ | tᵢ₋₁, tᵢ₋₂) → better captures context patterns.

20. Unsupervised POS HMM accuracy increases with:

Answer: B
Explanation:

Morphology aids tagging without labels → suffix, prefix, capitalization rules.

Thursday, December 4, 2025

10 Advanced Machine Learning MCQs with Answers & Explanations (Generative vs Discriminative, KDE, Boosting, k-NN)

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

Hidden Markov Model - MCQs - Problem-based Practice Questions

Machine Learning - Advanced MCQs

Understanding the foundations of machine learning requires a strong grasp of how different models learn from data, make predictions, and generalize. This collection of MCQs covers essential concepts such as generative vs discriminative classification, k-NN behavior, MAP vs MLE estimation, boosting dynamics, kernel methods, and decision tree depth—topics frequently asked in exams, interviews, and university courses.

These questions are designed to strengthen conceptual clarity and test real-world intuition about model assumptions, probability distributions, density estimation, and decision boundaries.

Whether you are preparing for GATE, UGC NET, university assessments, data science interviews, or machine learning certifications, this curated set will help you quickly revise key principles and identify common pitfalls in ML theory.

1. In a generative classification model, once you estimate the class-conditional density P(X∣Y) and prior P(Y), the decision rule is obtained by:

Answer: B
Explanation:

Generative models estimate P(X∣Y) and P(Y). Classification is performed using Bayes’ rule: P(Y∣X) ∝ P(X∣Y)P(Y).

Generative models are a class of machine learning models that learn the underlying data distribution and can generate new data samples similar to those seen during training.

A generative model learns the joint probability distribution: 𝑃(𝑋,𝑌) or just 𝑃(𝑋). This means the model tries to understand how the data is produced, not just how to classify it.

2. Logistic regression and Gaussian Naive Bayes can produce identical decision boundaries under which condition?

Answer: A
Explanation:

GNB with shared identity covariance produces a linear discriminant identical to logistic regression’s functional form.

Understanding the Question

This question asks under which specific condition Logistic Regression (LR) and Gaussian Naive Bayes (GNB) classifiers produce identical decision boundaries. The key is understanding the mathematical relationship between these two seemingly different algorithms.

Both models, Logistic Regression (LR) and Gaussian Naïve Bayes (GNB) normally produce different decision boundaries because:

  • LR is discriminative → models 𝑃(𝑌∣𝑋). That is, Logistic Regression directly models the conditional probability 𝑃(𝑌∣𝑋) using the logistic function.
  • GNB is generative → models 𝑃(𝑋∣𝑌). That is, Gaussian Naive Bayes is a generative classifier that models the joint probability 𝑃(𝑋,𝑌) by estimating P(Y) and 𝑃(𝑋∣𝑌).

But under a special condition, they produce identical linear decision boundaries. That special condition is: When the covariances of all classes are identity and equal.

When GNB has: Identity covariance, that means, no correlation between features, each feature has variance=1 and same covariance for each class, Gaussian Naïve Bayes's decision boundary has the same mathematical form as Logistic Regression.

Both models produce: 𝑤𝑋+𝑏=0. Same functional form, so same separating hyperplane.
3. Which of the following best explains why the training error of 1-NN is always zero?

Answer: B
Explanation:

Each training sample is its own closest neighbor, so 1-NN always predicts correctly on training data.

What is 1-NN?

1-NN means 1-Nearest Neighbor, which is the simplest form of the k-Nearest Neighbors (k-NN) algorithm. 1-NN classifier assigns the class of a new point based on the single closest training point in the dataset.

Why is the training error of 1-NN always zero?

In 1-Nearest Neighbor classification, when predicting the label of a data point, the algorithm finds the closest point in the dataset. But if you test 1-NN on the same training data, then every training point’s nearest neighbor is itself (distance = 0). So the classifier simply returns its own label, which is always correct.

Thus: Training Error = 0, because no point is misclassified when it's compared with itself.
4. For which type of prior does the MAP estimate not converge to the MLE even with infinite data?

Answer: C
Explanation:

A degenerate (a.k.a. point-mass or delta) prior forces the parameter to a fixed value regardless of data, so MAP ≠ MLE even with infinite samples.

5. Cross-validation is useful in boosting primarily because:

Answer: A
Explanation:

Boosting can overfit if allowed to run indefinitely; CV selects the optimal number of rounds.

Boosting keeps improving training accuracy indefinitely and can easily overfit, so cross-validation is needed to decide how many boosting steps to perform.

What is boosting?

Boosting is a family of ensemble learning techniques that turn a collection of weak learners (models that are only slightly better than random guessing) into a single strong learner with high predictive accuracy. The core idea is simple: train models sequentially, each one focusing on the mistakes made by the previous ones, and then combine their predictions (usually by a weighted vote or sum). By doing this, the ensemble corrects its own errors over time and ends up far more powerful than any individual component.

What is cross-validation?

Cross-validation is a fundamental resampling technique used to evaluate machine learning models' ability to generalize to unseen data while preventing overfitting. It works by systematically partitioning the dataset into multiple subsets (called folds), training models on some subsets, and testing on others, with this process repeated multiple times to obtain a reliable performance estimate.

Why does boosting need cross-validation?

Boosting algorithms (like AdaBoost, Gradient Boosting, XGBoost, etc.) build models sequentially, adding weak learners (usually decision stumps/trees) one at a time.

Unlike many other models:
  • There is no built-in rule that tells you when to stop adding more learners.
  • If you keep boosting longer, the model can overfit heavily.
So, to choose the right number of boosting rounds, we use cross-validation. Cross-validation helps to decide: How many weak learners give the best performance without overfitting?

This is why libraries like XGBoost include a parameter like early_stopping_rounds, which depends on a validation set.

6. Kernel Density Estimation (KDE) differs from kernel regression because:

Answer: A
Explanation:

KDE estimates P(X), while kernel regression estimates the functional relationship ŷ(x) via weighted averages.

Differences between KDE and Kernel regression


What each method estimates/answers:
  • KDE answers "what is the probability density?" (it answers, 'how are the data distributed?')
  • Kernel regression answers "what is the function value or conditional expectation?" (it answers, 'Given X, what is Y?')
How kernels are used?
  • KDE uses kernels to smooth the estimated probability distribution,
  • Kernel regression uses kernels to perform weighted local averaging to estimate a conditional relationship between variables.
When to use?
  • Use Kernel Density Estimation when you want to understand how the data is distributed, especially when you do NOT assume the distribution is normal. Example: Estimate the density of customer ages
  • Use kernel regression when you want to predict Y from X in a non-parametric, smooth way.
Supervised vs Unsupervised
  • KDE is unsupervised
  • Kernel regression is supervised
7. Boosting a set of weak learners generally produces a decision boundary that is:

Answer: B
Explanation:

Boosting aggregates many weak rules, often resulting in highly nonlinear decision boundaries.

How does boosting affect the complexity of the final decision boundary?

Boosting (e.g., AdaBoost, Gradient Boosting) works by combining many weak learners, typically simple classifiers like decision stumps (depth-1 trees). Each weak learner itself has a simple decision boundary.

But boosting does not just average them; it takes a weighted combination based on each learner’s accuracy. Adding many simple boundaries creates a final decision boundary that can be very complex, often highly nonlinear.

This happens because each new weak learner focuses on misclassified points from previous learners, gradually bending the overall decision surface.

8. Which scenario explains how a decision tree can exceed training samples in depth?

Answer: B
Explanation:

If identical feature vectors map to conflicting labels, the tree keeps splitting and can exceed depth n.

Why can a decision tree have depth greater than the number of training samples?

Because depth counts the number of splits along a path, not the number of unique samples or unique feature values. Even if features repeat, the tree keeps splitting as long as it can reduce impurity—possibly creating long chains of binary splits, each separating a subset of samples, even if they have identical feature values.

Why does this happen with repeated features?

When features repeat across multiple samples:
  • The tree must use the same features repeatedly to separate conflicting labels.
  • Each split on a feature that has been previously split becomes less efficient at separating classes.
  • The tree exhibits overfitting behavior, attempting to memorize individual samples rather than learn generalizable patterns.
  • If samples are identical in their selected features but have different labels, the tree becomes unable to achieve purity through feature thresholds alone

Decision trees try to make leaves pure. If purity is impossible, depth grows uncontrollably. This is why real systems use: max_depth, min_samples_split, min_samples_leaf.

To avoid pathological overfitting trees.

Example:

When the feature values are repeated (e.g. many rows have x = 5) but the labels differ, the tree may keep trying thresholds that slice right at the repeated value. If the algorithm does not enforce a “strictly decreasing impurity” condition, it could accept a split that leaves the dataset unchanged on one side.

9. In k-NN classification, which statement best explains why increasing k (while keeping the dataset fixed) can improve test performance, especially in noisy datasets?

Answer: A
Explanation:

When k increases, the prediction is based on a majority vote over a larger set of neighbors, which reduces the influence of mislabeled or noisy points. This typically improves generalization by lowering variance, although extremely large k can lead to underfitting.

Larger k reduces sensitivity to noise by averaging over more neighbors.
  • Averaging = majority vote – By looking at several nearby points instead of just one, the classifier “averages” their labels. If a few of those neighbours are mislabeled (or are outliers), they are unlikely to dominate the vote.
  • Noise reduction – Random fluctuations in the training labels act like noise. Majority voting behaves like a low‑pass filter: it suppresses high‑frequency (noisy) variations while preserving the underlying signal.
  • Result on test error – Lower variance ⇒ the learned decision surface is more stable on unseen data, so test error typically goes down (up to a point; if k becomes too large, bias dominates and performance can deteriorate).

Thus, averaging over more neighbours mitigates the effect of noisy or atypical training points, which is why test performance usually improves.

10. Which statement about generative vs discriminative models is correct?

Answer: B
Explanation:

Discriminative models learn P(Y∣X) or direct decision boundaries. Generative models learn P(X,Y).

Tuesday, December 2, 2025

Master HMM with MCQs – Hidden Markov Model Tagging Explained

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

Hidden Markov Model - MCQs - Problem-based Practice Questions

HMM-Based POS Tagging Practice

These questions explore key aspects of Hidden Markov Model (HMM) based Part-of-Speech (POS) tagging. Some questions explicitly provide prior (initial) probabilities, while others focus only on transition and emission probabilities. You will practice:

  • Calculating posterior probabilities for individual words.
  • Evaluating sequence likelihoods using transitions and emissions.
  • Handling unseen words with smoothing techniques.
  • Determining most likely tag sequences based on high-probability transitions.
1. Consider the following HMM for POS tagging:

Emission ProbabilitiesTransition Probabilities
P(dog | Noun) = 0.6P(next = Noun | current = Noun) = 0.4
P(dog | Verb) = 0.1P(next = Verb | current = Noun) = 0.6
P(runs | Noun) = 0.1P(next = Noun | current = Verb) = 0.5
P(runs | Verb) = 0.7P(next = Verb | current = Verb) = 0.5

In the table, 'next' and 'current' in the probability P(next = Noun | current = Noun) refer to 'POS tag of next word' and 'POS tag of current word' respectively.
Which is the most likely tag sequence for the sentence “dog runs” using the HMM?

A. Noun → Noun
B. Noun → Verb
C. Verb → Noun
D. Verb → Verb

Answer: B
Explanation:

P(Noun→Verb) = 0.6 × 0.6 × 0.7 = 0.252. Highest likelihood = Noun→Verb.

Step-by-Step Probability Computation

We compute the probability for each possible tag sequence using:

P(t₂ | t₁) × P(dog | t₁) × P(runs | t₂)


1. Sequence: Noun → Noun
  • P(dog | Noun) = 0.6
  • P(Noun | Noun) = 0.4
  • P(runs | Noun) = 0.1

0.6 × 0.4 × 0.1 = 0.024


2. Sequence: Noun → Verb
  • P(dog | Noun) = 0.6
  • P(Verb | Noun) = 0.6
  • P(runs | Verb) = 0.7

0.6 × 0.6 × 0.7 = 0.252


3. Sequence: Verb → Noun
  • P(dog | Verb) = 0.1
  • P(Noun | Verb) = 0.5
  • P(runs | Noun) = 0.1

0.1 × 0.5 × 0.1 = 0.005


4. Sequence: Verb → Verb
  • P(dog | Verb) = 0.1
  • P(Verb | Verb) = 0.5
  • P(runs | Verb) = 0.7

0.1 × 0.5 × 0.7 = 0.035

Highest Probability = 0.252

Most likely tag sequence:

B. Noun → Verb

2. For the HMM below:

Initial tag probabilities:
• P(Noun) = 0.7
• P(Adj) = 0.3

Emission probabilities for the word "red":
• P("red" | Noun) = 0.2
• P("red" | Adj) = 0.8

Calculate the normalized probability that 'red' is tagged as an Adjective.

A. 0.14
B. 0.56
C. 0.63
D. 0.24

Answer: C
Explanation:

Compute unnormalized scores:
• Adj = 0.3 × 0.8 = 0.24
• Noun = 0.7 × 0.2 = 0.14

Normalize to get posterior:
Adj = 0.24 / (0.24 + 0.14) ≈ 0.63.

Calculate the probability of a specific tag assignment for a single observed word

In a Hidden Markov Model (HMM), the probability of a specific tag assignment for a single observed word is calculated using the Joint Probability of the tag and the word. This is often referred to as the "Viterbi score" or "path probability" for that specific tag.

The formula for the joint probability of a single state (tag) and observation (word) is:

P(Tag,Word) = P(Tag) × P(Word∣Tag)

Where:

  • P(Tag) is the Initial tag probability (Prior).
  • P(Word∣Tag) is the Emission probability (Likelihood).

Step 1: Calculate the Joint Probabilities for All Tags

For each possible tag, compute the product of the initial probability and emission probability:

For tag = Adjective (Adj):


P(Adj,"red") = P(Adj) × P("red"∣Adj) = 0.3 × 0.8 = 0.24

For tag = Noun:


P(Noun,"red") = P(Noun) × P("red"∣Noun) = 0.7 × 0.2 = 0.14


Step 2: Calculate the Normalizing Constant (Total Probability)

The normalizing constant is the sum of all joint probabilities:

P("red") = P(Adj,"red") + P(Noun,"red") = 0.24 + 0.14 = 0.38


Step 3: Apply Bayes' Theorem to Get the Posterior Probability

Using the normalization formula:

P(Adj∣"red") = P(Adj,"red") / P("red") = 0.24 / 0.38

Simplifying:


P(Adj∣"red") = 0.24 / 0.38 ≈ 0.6316 or approximately 63.16%

Final Answer

The normalized probability that 'red' is tagged as an Adjective is:

P(Adj∣"red") = 0.24 / 0.38 ≈ 0.632 or 63.2%

3. Using the HMM:

TransitionDetNoun
Det0.10.9
Noun0.40.6

Emission P(word|tag)DetNoun
"the"0.80.05
"cat"0.010.9

Most likely tagging for "the cat" is:

A. Det → Det
B. Det → Noun
C. Noun → Det
D. Noun → Noun

Answer: B
Explanation:

Solution: Let's solve this step by step using the Hidden Markov Model (HMM)

The goal is to find the most likely sequence of tags for the sentence "the cat" using the Viterbi principle.

Step 1: Understand the tables

Transition probabilities (P(tag₂ | tag₁)):

From\ToDetNoun
Det0.10.9
Noun0.40.6

For example, if the previous tag is Det, the probability that the next tag is Noun is 0.9.

Emission probabilities (P(word | tag)):

WordDetNoun
the0.80.05
cat0.010.9

For example, the probability that the word "cat" is emitted by a Noun is 0.9.

Step 2: Compute joint probabilities for all sequences

We consider all possible tag sequences for "the cat":

Det → Det

P(Det → Det) = transition × emission
Step 1 (first word "the" as Det): P("the"|Det) = 0.8
Step 2 (second word "cat" as Det): transition P(Det|Det) = 0.1, emission P("cat"|Det) = 0.01

Total probability = 0.8 × 0.1 × 0.01 = 0.0008

Det → Noun

Step 1 "the" as Det: P("the"|Det) = 0.8
Step 2 "cat" as Noun: transition P(Noun|Det) = 0.9, emission P("cat"|Noun) = 0.9

Total probability = 0.8 × 0.9 × 0.9 = 0.648

Noun → Det

Step 1 "the" as Noun: P("the"|Noun) = 0.05
Step 2 "cat" as Det: transition P(Det|Noun) = 0.4, emission P("cat"|Det) = 0.01

Total probability = 0.05 × 0.4 × 0.01 = 0.0002

Noun → Noun

Step 1 "the" as Noun: P("the"|Noun) = 0.05
Step 2 "cat" as Noun: transition P(Noun|Noun) = 0.6, emission P("cat"|Noun) = 0.9

Total probability = 0.05 × 0.6 × 0.9 = 0.027

Step 3: Compare probabilities

SequenceProbability
Det → Det0.0008
Det → Noun0.648
Noun → Det0.0002
Noun → Noun0.027

Step 4: Most likely tagging

The most likely sequence is: Det → Noun (Option B)

"the" is a determiner (Det), and "cat" is a noun (Noun)

4. Given the emission matrix:

Word → TagVerbAdv
"quickly"0.20.7

If P(Verb)=0.5 and P(Adv)=0.5 initially, probability the word "quickly" is tagged Adv:

A. 0.41
B. 0.55
C. 0.78
D. 0.64

Answer: C
Explanation:

P(Tag,Word) = P(Tag) × P(Word∣Tag)

If "quickly" is tagged as Verb: 0.5×0.2=0.10;

If "quickly" is tagged as Adv: 0.5×0.7=0.35.

Highest is Adv. Hence, normalized Adv = 0.35/(0.10+0.35) = 0.35/0.45 ≈ 0.78.

5. A word appears 10 times as Noun and 2 times as Verb in training. Without smoothing P(word|Noun)= ?

A. 0.2
B. 0.5
C. 0.83
D. 0.91

Answer: C
Explanation:

P(word|noun) means "Out of all times the word occurs, how many times did it occur with the tag Noun?". We are not doing any smoothing—just using raw counts.

The word appears 10 times as Noun

The same word appears 2 times as Verb

Total appearances of the word = 10 + 2 = 12


Since we want P(word | Noun):

P(word | Noun) = Count(word with Noun) / Total count of the word

Substitute the values

P(word | Noun) = 10 / 12 = 0.8333

Rounded: 0.83

6. In an HMM POS-tagger, we want to estimate the emission probability of an unseen word. Consider the word "glorf", which never occurred in the training data.
For the tag Noun, the training corpus contains:
  • Total noun-tagged word tokens = 50
  • Count of "glorf" = 0
  • Vocabulary size (unique words) = 10
Using Add-1 (Laplace) smoothing, compute 𝑃("glorf" ∣ Noun).

A. 1/60
B. 1/51
C. 1/61
D. 51/61

Answer: A
Explanation:

Laplace smoothing → (0+1)/(50 + 10) = 1/60.

Understanding the question

We have an unseen word: "glorf"
That means in the training data count("glorf" | Noun) = 0.
We want to compute P("glorf" | Noun) using Add-1 (Laplace) smoothing.

✅ Given

Total noun tokens 50
Count of "glorf" under Noun 0
Vocabulary size (V) 10

Add-1 smoothing formula

P(w | tag) = (count(w, tag) + 1) / (total tokens under tag + V)


Step-by-step calculation

P("glorf" | Noun) = (0 + 1) / (50 + 10) = 1 / 60

7. Which sentence has lower HMM likelihood given high Verb→Noun transition?

A. eat food
B. food eat

Answer: B
Explanation:

"food eat" requires Noun→Verb, which may be low and less natural under English HMM statistics. Because its tag sequence (Noun → Verb) does NOT match the high-probability Verb → Noun transition that the HMM expects.

"eat food" (Verb -> Noun) has HIGH HMM likelihood


"food eat" (Noun -> Verb) has LOW HMM likelihood

8. Given partial Viterbi table:

twordbest tagprob
1fishNoun0.52
2swimVerb0.46

Assume the HMM has a strong Verb → Noun transition (i.e., P(Noun|Verb) is high).

Model predicts next tag likely:

A. Noun
B. Verb
C. Both equal
D. Cannot determine

Answer: A
Explanation:

Since the best tag at t=2 is Verb, the predicted next tag depends mainly on the transition probabilities from Verb. The question explicitly states that Verb → Noun transition is strong. Therefore, the HMM expects the next tag to be Noun with highest probability.

Why the Viterbi algorithm predicts Noun as the next tag

The Viterbi algorithm will predict Noun as the most likely next tag because:

  • High transition probability boost: P(Noun|Verb) is high, which significantly increases the probability of the Noun path.
  • Natural language patterns: Verbs commonly take noun objects in English (for example, "swim laps", "fish upstream"), so Verb → Noun sequences are frequent.
  • Viterbi maximization: The algorithm selects the tag sequence that produces the maximum accumulated probability. With a strong Verb→Noun transition, the Noun path will typically have a higher accumulated probability than alternatives.

The strong transition probability from Verb to Noun makes this the most likely prediction for the next tag in the sequence.

9. In an HMM for POS tagging, you are given the following transition probabilities for adjectives:

  • An adjective is followed by a noun with probability 0.75
  • An adjective is followed by another adjective with probability 0.10
These probabilities tell us which tags usually come after an adjective in the training data.

Using only these transition probabilities, which 2-word phrase does the HMM consider more likely?

A. beautiful red
B. beautiful flower

10. In an HMM POS tagger, you observe the single word "cat". The model gives you the following probabilities:

Tag TransitionProbability
DT → NN0.8
DT → VB0.2
Emission"cat"
NN emits "cat"0.7
VB emits "cat"0.1
For this one-word sentence, the tag is chosen mainly based on the emission probability of the word. Based on these values, which tag is the HMM most likely to assign to the word "cat"?

A. DT
B. NN
C. VB
D. Cannot determine

Monday, December 1, 2025

HMM-Based POS Tagging MCQs | Viterbi, Emission & Transition Explained

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

1. In an HMM-based POS tagger, hidden states typically represent:

A. Words in the text
B. POS tags
C. Syntactic chunks
D. Sentence categories

Answer: B
Explanation:

In POS tagging using Hidden Markov Models, hidden states correspond to POS tags like NN, VB, DT, etc., while observations are the actual words.

________________________________________
2. The emission probability in an HMM for POS tagging represents:

A. P(tag | word)
B. P(word | tag)
C. P(tag | sentence length)
D. P(sentence | tags)

Answer: B
Explanation:

Emission probability in HMM defines the likelihood of generating (emitting) a word from a particular POS tag: P(word | tag).

What is emission probability?

In the context of Hidden Markov Models (HMMs), the emission probability refers to the likelihood of observing a particular output (observation) from a given hidden state.

HMM Components

An HMM consists of:

  • Hidden states (S) (These are not directly observable), Observations (O) (These are visible outputs generated by the hidden states),Transition probabilities (A) (Probability of moving from one hidden state to another), and Emission probabilities (B) (Probability of a hidden state generating a particular observation).


Emission Probability Formula

If s is a hidden state and o is an observation:

Emission probability = P(observation | state) = P(o | s)

In POS tagging, this is the probability that a tag emits a particular word.

Example: P("dog" | NN) = 0.005

This means that the word "dog" is generated by the NN (noun) tag with probability 0.005.

Intuition

  • Hidden state: “POS tag of the current word”
  • Observation: “Actual word in the sentence”

Emission probability answers:

“Given that the current word has tag NN, how likely is it to be this particular word?”

________________________________________
3. Transition probabilities in POS HMM tagging capture:

A. Probability of a word given tag
B. Probability of current tag given previous tag
C. Probability of unknown word generation
D. Probability of sentence boundary

Answer: B
Explanation:

Transition probability expresses tag–tag dependency e.g., P(NN | DT) — more likely because determiners commonly precede nouns.

What is transition probability in HMM context?

In Hidden Markov Models (HMMs), the transition probability represents the likelihood of moving from one hidden state to another in a sequence.

Formal Definition

If st is the current hidden state and st-1 is the previous hidden state, then:

Transition Probability = P(sₜ | sₜ₋₁)
  

It answers the question:

Given the previous hidden state, what is the probability of transitioning to the next state?

Where It Applies

In tasks like Part-of-Speech (POS) tagging:

  • Hidden states = POS tags (NN, VB, DT, JJ...)
  • Transition probability models how likely one tag follows another

Example

P(NN | DT) = 0.65
  

Meaning: If the previous tag is DT (determiner), there is a 65% chance the next tag is NN (noun) (common phrase pattern: the cat, a dog, this book).

________________________________________
4. The algorithm used to find the most probable tag sequence in POS HMM is:

A. Forward algorithm
B. CYK algorithm
C. Viterbi decoding
D. Naive Bayes

Answer: C
Explanation:

Viterbi is a dynamic programming algorithm used for optimal decoding — finding the best tag sequence for a sentence.

Viterbi algorithm

The Viterbi algorithm is a dynamic‑programming method that finds the most probable sequence of hidden states (a path) that could have produced a given observation sequence in a Hidden Markov Model (HMM).

________________________________________
5. In HMM POS tagging, unknown words are usually handled using:

A. Ignoring them during tagging
B. Assigning probability zero
C. Smoothing or suffix-based rules
D. Removing them from corpus

Answer: C
Explanation:

Unknown/rare words are tackled using morphological heuristics, smoothing (Laplace, Good-Turing) or suffix-based tagging methods.

What is smoothing and why is it needed?

In HMM POS tagging, we rely on:

  • Transition probabilities → P(tagt | tagt-1)
  • Emission probabilities → P(word | tag)

If a word never appeared in training data, its emission probability becomes:

P(word | tag) = 0

This is a problem because one zero probability makes the entire sentence probability zero, causing the Viterbi decoding to fail.


What Smoothing Does

Smoothing reassigns small probability to unseen words/events instead of zero. It ensures the model can still tag new sentences even with unknown words.

________________________________________
6. If an HMM uses T tags and vocabulary size V, emission matrix dimension is:

A. V × V
B. T × V
C. T × T
D. 1 × V

Answer: B
Explanation:

Every tag generates any word — hence matrix = #Tags × #Words.

________________________________________
7. A bigram POS HMM assumes:

A. Tag depends on all previous tags
B. Tag depends only on previous tag
C. Word and tag are independent
D. Tags follow uniform probability

Answer: B
Explanation:

Markov assumption → P(tᵢ | tᵢ₋₁), not dependent on entire tag history.

________________________________________
8. The Baum-Welch algorithm trains POS HMM using:

A. Gradient descent
B. Evolutionary optimization
C. Expectation–Maximization (EM)
D. Manual rules

Answer: C
Explanation:

Baum-Welch is an unsupervised EM algorithm re-estimating transition + emission probabilities.

________________________________________
9. Viterbi differs from Forward algorithm because it:

A. Sums probabilities of all paths
B. Chooses the maximum probability path
C. Works only for continuous observations
D. Does not use dynamic programming

Answer: B
Explanation:

Forward algorithm sums over paths. Viterbi picks best single path (max probability).

________________________________________
10. HMM POS tagging suffers most when:

A. Vocabulary is large
B. Words are highly ambiguous
C. Text is short
D. Emission is continuous

Answer: B
Explanation:

Ambiguous words like bank, can, light require context HMM cannot model deeply.

Why does HMM suffer ambiguous words?

HMMs are probabilistic sequence models based on transition and emission probabilities, so when words are highly ambiguous, the model struggles because multiple POS tags have similar probabilities for the same word.

HMM suffers with highly ambiguous words because it relies only on emission and transition probabilities, so when multiple tags are equally likely for the same word, the model becomes uncertain and may choose the wrong POS tag.

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents