Which feature improves unknown word tagging accuracy the most in HMM POS tagging?

Learning prefix and suffix distributions per POS tag improves accuracy by leveraging morphological cues.

Training an HMM with labeled POS data is classified as what type of learning?

It is supervised learning since both words and POS tags are known during training.

Which algorithm is used to train an HMM when POS tags are not available?

The Forward-Backward (Baum-Welch) algorithm is used for unsupervised HMM training.

When is error propagation more likely in Viterbi decoding?

When a rare word has a sharply peaked but incorrect emission probability, causing early path commitment.

Why do modern POS taggers outperform HMM-based models?

Neural models capture long-range context and subword features beyond the Markov assumption.

What does a high P(NN | DT) transition probability indicate?

It indicates that nouns are likely to follow determiners, which is common in English syntax.

How is sentence probability computed in an HMM?

It is computed as the product of transition probabilities and emission probabilities across the sentence.

What information is stored in the Viterbi backpointer table?

It stores the most likely previous tag for reconstructing the optimal tag sequence.

Why are Gaussian emission HMMs preferred in speech tagging?

Because speech features are continuous-valued, making Gaussian distributions suitable.

Emission probability in POS tagging refers to:

Probability of word given a tag

Probability of sentence structure

Probability of next two tags jointly

Which problem does Baum–Welch training solve in HMM POS tagging?

Learning parameters from unlabeled text

A common solution for unknown words in HMM POS tagging is:

Laplace or Good–Turing smoothing

What does a second-order POS HMM consider?

A second-order POS HMM considers two previous tags, using P(ti | ti-1, ti-2) to improve contextual modeling.

What is required for supervised POS HMM training?

Supervised POS HMMs need word-tag annotated sentences to estimate emission and transition probabilities.

What does decoding mean in HMM-based POS tagging?

Decoding refers to selecting the most likely tag sequence for a word sequence, typically using the Viterbi algorithm.

What is the Forward-Backward algorithm used for in HMMs?

It is used for unsupervised HMM learning as part of the Baum-Welch EM algorithm to estimate expected probabilities.

Why is smoothing important in HMMs for POS tagging?

Smoothing prevents zero-probability transitions and emissions, allowing the model to handle unseen word-tag pairs.

What is a core limitation of POS HMMs?

HMMs assume Markov independence and that words depend only on their tags, which limits contextual accuracy.

What happens when hidden states in a POS HMM are increased?

Increasing hidden states increases parameters, raising the risk of overfitting and slowing inference.

How should rare words be handled in POS HMMs?

Rare words are best handled using smoothing techniques like Laplace or Good-Turing smoothing.

How does a trigram HMM improve POS tagging?

A trigram HMM models P(ti | ti-1, ti-2), capturing richer tag context for improved tagging accuracy.

What improves accuracy in unsupervised POS HMMs?

Morphological features and smoothing improve accuracy by providing additional cues without labeled data.

Transition probabilities in POS HMM tagging capture?

Probability of current tag given previous tag

In HMM POS tagging, unknown words are usually handled using?

Smoothing or suffix-based rules

A bigram POS HMM assumes?

Tag depends only on previous tag

The Baum-Welch algorithm trains POS HMM using?

Expectation–Maximization (EM)

Viterbi differs from Forward algorithm because it?

Chooses the maximum probability path

Computer Science and Engineering - Tutorials, Notes, MCQs, Questions and Answers

Q: If transitions are uniform/random, HMM POS tagger becomes?

It becomes equivalent to a unigram emission-based selector, relying only on P(word | tag).

Showing posts with label HMM. Show all posts

Monday, January 5, 2026

HMM POS Tagging MCQs (Advanced) | Viterbi, Baum-Welch & NLP Concepts

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.

20. If transitions are uniform/random, HMM POS tagger becomes:

A. Fully deterministic
B. Distribution-free classifier
C. Equivalent to unigram emission selector
D. Memory-based classifier

Correct Answer: C

With uniform transitions, tagging depends only on P(word|tag), i.e., emission probabilities.

With uniform transitions, an HMM POS tagger reduces to a unigram model that tags each word independently using emission probabilities only.

Step-by-step Explanation

An HMM POS tagger assigns part-of-speech tags using two probabilities:

Transition probability - P(t_i | t_i−1)
→ How likely a tag follows the previous tag

Emission probability - P(w_i | t_i)
→ How likely a word is generated by a tag

During decoding using the Viterbi algorithm, the model maximizes:

P(t_i | t_i−1) × P(w_i | t_i)

What does uniform / random transitions mean?

Uniform transitions imply:

P(t_i | t_i−1) = constant for all tag pairs

Transition probabilities do not prefer any particular tag sequence
They contribute the same value for every possible path

Therefore, transition probabilities no longer influence the tagging decision.

What remains?

Only the emission probabilities matter:

arg max_{t_i} P(w_i | t_i)

This is exactly what a unigram POS tagger does:

Assigns each word the tag with the highest emission probability
Ignores contextual information entirely

21. Unknown word tagging accuracy is highest when model learns:

A. Prefix/suffix distribution per POS
B. Word frequency only
C. Stopword probability
D. Character bigrams only

Correct Answer: A

This question is about how POS taggers handle unknown (out-of-vocabulary) words—words that were not seen during training.

In Part-of-Speech (POS) tagging, unknown words present a fundamental challenge—they don't appear in the training corpus, so the model cannot rely on learned word-to-tag associations. The solution lies in morphological features, particularly prefix and suffix distributions linked to grammatical categories. Morphological cues like -ly, -ness, -tion strongly correlate with POS tags.

Prefix/suffix distribution per POS

Why? Many parts of speech follow strong morphological patterns:

-tion, -ness → Noun
-ly → Adverb
-ing, -ed → Verb
un-, re-, pre- → Verbs / Adjectives

By learning which prefixes and suffixes are likely for each POS, the model can:

Infer the POS of new (unknown) words it has never seen

This is the most effective and widely used approach in POS tagging models such as HMMs, CRFs, and neural taggers.

Therefore, unknown word tagging accuracy is highest when the model learns prefix/suffix distributions per POS.

22. Training HMM with labeled POS corpus is:

A. Unsupervised
B. Supervised
C. Reinforcement-based
D. Zero-shot

Correct Answer: B

Both words and tags are known, so probabilities are estimated directly.

23. If only words are available (no tags), HMM must be trained using:

A. Viterbi only
B. Forward-Backward (Baum-Welch)
C. MEMM
D. CRF

Correct Answer: B

Baum–Welch uses EM to estimate hidden states from unlabeled data.

This question tests your understanding of the three fundamental HMM problems and when to apply each algorithm.

This question asks about the Learning problem of three HMM problems. As per the Learning problem, we are given only the observation sequence without tags, and we need to find the model parameters with the help of Forward-Backward (Baum-Welch) algorithm. This is unsupervised learning.

24. Error propagation in Viterbi decoding is more likely when:

A. Transition matrix is dense
B. Emission probability is sharply peaked for rare word
C. All tags have equal probability
D. Frequent tags dominate data

Correct Answer: B

A wrong but strong emission can lock Viterbi into an incorrect path.

More information:

The question is about Viterbi decoding. Viterbi decoding is used in POS tagging and other sequence labeling tasks. In these tasks, each tag depends on the previous tag.

What is error propagation?

In Viterbi decoding: The algorithm selects the best path step by step. If a wrong tag is chosen early, that wrong choice affects the next tags. As a result, more errors occur later. This spreading of mistakes is called error propagation.

Why does error propagation happen in Viterbi algorithm?

Viterbi uses dynamic programming. It keeps only one best path, not many alternatives. It depends on: Transition probabilities (from one tag to the next) and Emission probabilities (tag to word). If the model is too confident about a wrong tag, the error continues through the sentence.

Why option B is correct?

For a rare or unknown word, the model assigns: Very high probability to one tag and Very low probability to others. If that high-probability tag is wrong, Viterbi commits strongly to it and the alternative paths are discarded. As a result, future tags are forced to follow this wrong tag via transitions

25. Modern POS taggers outperform HMM mainly because:

A. HMM is non-probabilistic
B. Neural models capture long-context + subword info
C. HMM has no decoding algorithm
D. HMM only works on small corpora

Correct Answer: B

Neural models capture global dependencies beyond Markov assumptions.

26. High P(NN | DT) indicates:

A. Noun unlikely after determiner
B. Noun likely after determiner
C. Determiner depends on word length
D. Transition invalid

Correct Answer: B

Determiner → noun is a common English syntactic pattern.

P(NN | DT) means, "The probability that the next tag is a Noun (NN), given that the current tag is a Determiner (DT)".

High probability indicates that Determiners are usually followed by Nouns in language data. Example: "the book", "a pen", "this idea", etc.

27. Sentence probability in an HMM is:

A. Sum of emission probabilities
B. Sum of transitions
C. Sum of products of transition and emission probabilities across all possible state sequences
D. Ratio of emissions

Correct Answer: C

HMM probability is the product over all transition and emission terms.

Sentence probability in HMM = the sum of probabilities of all possible hidden state sequences that could generate the observed sentence, where each path's probability is the product of its transition and emission probabilities.

This probabilistic framework is what makes HMMs powerful for sequence modeling in NLP and speech processing—it elegantly handles the uncertainty of hidden states while maintaining computational efficiency through dynamic programming.

28. Viterbi backpointer table stores:

A. Loss function values
B. Most likely previous tag
C. Word embeddings
D. Vocabulary index

Correct Answer: B

Backpointers reconstruct the optimal tag sequence.

The Viterbi backpointer table is a table that stores where each best probability came from during Viterbi decoding. In simple words, it remembers which previous state (tag) gave the best path to the current state.

Why do we need a backpointer table?

Viterbi decoding has two main steps: (1) Forward pass (Compute the best probability of reaching each state at each time step, Store these probabilities in the Viterbi table.), (2) Backtracking (Recover the best sequence of tags, For this, we must know which state we came from). The backpointer table makes backtracking possible.

Simple Viterbi Example

Sentence: "Time flies"

Possible Tags:

NN (Noun)
VB (Verb)

Viterbi Probability Table

Time	NN	VB
t = 1	0.3	0.7
t = 2	0.6	0.4

Backpointer Table

Time	NN	VB
t = 2	VB	NN

Interpretation

Best path to NN at t = 2 came from VB at t = 1.
Best path to VB at t = 2 came from NN at t = 1.

29. Gaussian emission HMMs are preferred in speech tagging because:

A. Text is discrete
B. Speech features are continuous
C. POS-tags depend on semantics
D. They remove smoothing

Correct Answer: B

Acoustic signals are continuous-valued, well-modeled by Gaussians.

Fundamental Distinction: Discrete vs. Continuous Observations

The choice between discrete and Gaussian (continuous) HMM emission distributions depends entirely on the nature of the observations being modeled.

Discrete HMMs represent observations as discrete symbols from a finite alphabet—such as words in part-of-speech tagging or written text. When observations are discrete, emission probabilities are modeled as categorical distributions over symbol categories.

Continuous (Gaussian) HMMs represent observations as continuous-valued feature vectors. When observations are real-valued, discrete emission probabilities are not applicable; instead, the probability density is modeled using continuous distributions such as Gaussians or Gaussian Mixture Models (GMMs).

Why Gaussian Emission HMMs Fit Speech Data

In speech tagging or speech recognition, the observed data are acoustic features, not words.

Hidden Markov Models require an emission model to represent:

P(observation | state)

In text POS tagging → observations are discrete words
In speech tagging → observations are continuous feature vectors

Therefore, Gaussian (or Gaussian Mixture) distributions are ideal for modeling continuous acoustic data.

Gaussian-emission HMMs model:

P(x_t | state_t)

where x_t is a continuous acoustic feature vector.

30. HMM POS tagging underperforms neural models mainly because it:

A. Requires GPUs
B. Models short context only
C. Cannot generate emissions
D. Lacks training algorithms

Correct Answer: B

HMMs rely on local Markov assumptions, unlike deep contextual models.

HMMs underperform because they rely on short-context Markov assumptions, while neural models capture long-range and global linguistic information.

HMM-based POS taggers rely on the Markov assumption, typically using bigram or trigram tagging. This means the POS tag at position t depends only on a very limited local context.

P(t_t | t_t−1) or P(t_t | t_t−1, t_t−2)

In other words, Hidden Markov Models (HMMs) assume:

The current POS tag depends only on a limited number of previous tags (usually one in bigram HMMs, two in trigram HMMs).
They cannot look far ahead or far behind in a sentence.

Example:

“The book you gave me yesterday was interesting.”

To correctly tag the word “was”, the model benefits from understanding long-distance syntactic relationships. However, HMMs cannot effectively capture such long-range dependencies.

Why neural models perform better

Modern neural POS taggers such as BiLSTM, Transformer, and BERT:

Capture long-range dependencies across the sentence
Use bidirectional context (both left and right)
Learn character-level and subword features
Handle unknown and rare words more effectively

Thursday, December 11, 2025

Top 10 Advanced HMM for POS Tagging — Important MCQs (2nd Order, Trigram, Smoothing, Viterbi)

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

Advanced POS Tagging with Hidden Markov Models — MCQ Introduction

Master POS tagging with this focused MCQ set on Hidden Markov Models (HMM): second-order & trigram models, Viterbi decoding, Baum-Welch training, smoothing techniques, and practical tips for supervised & unsupervised learning.

Part-of-speech (POS) tagging is a cornerstone task in Natural Language Processing (NLP) that assigns grammatical categories (noun, verb, adjective, etc.) to each token in a sentence. This question set concentrates on classical statistical taggers built with Hidden Markov Models (HMMs), which model tag sequences as hidden states and observed words as emissions.

HMM-based POS taggers remain valuable for their interpretability and efficiency. They are particularly useful when you need:

Lightweight, fast taggers for resource-constrained systems
Explainable probabilistic models for linguistic analysis and teaching
Strong baselines before moving to neural models like BiLSTM or Transformer taggers

This MCQ collection targets advanced HMM concepts — including second-order (trigram) models, the Viterbi decoding algorithm, Forward–Backward / Baum–Welch for unsupervised learning, and various smoothing strategies to handle rare or unseen words. Each question includes a concise explanation to help you understand not only the correct choice but why it matters for real-world POS tagging.

What you’ll learn

How higher-order HMMs (trigram / second-order) capture broader tag context.
Why supervised training requires labeled word–tag corpora and how unsupervised EM works.
The purpose of smoothing (Laplace, Good-Turing) to avoid zero probabilities.
Trade-offs: model complexity, overfitting, and inference cost when increasing hidden states.

Use these MCQs to prepare for exams, interviews, or to evaluate your grounding before progressing to neural POS taggers. Scroll down to start the questions and test your understanding of HMM-based POS tagging fundamentals and advanced techniques.

11. A second-order POS HMM considers:

One previous tag Two previous tags No previous tag All future tags

Answer: B
Explanation:

Higher-order HMM models allow P(tᵢ | tᵢ₋₁, tᵢ₋₂) improving context.

A second-order POS HMM (Hidden Markov Model) is also called a trigram HMM.

It means: P(tᵢ | tᵢ₋₁, tᵢ₋₂).

This tells us: The probability of the current tag depends on two previous tags. Therefore, the HMM "looks back" two steps in the tag sequence.

Example: Let us take a simple sentence "dog chase cats". For the last word “cats”, whose tag is t₃, and given the previous two tags: t₂ = VERB (tag for “chase”) t₁ = NOUN (tag for “dogs”), you should write:

𝑃(𝑡_cats ∣ 𝑡_chase = VERB, 𝑡_dogs = NOUN)

12. Input for supervised POS HMM training must contain:

Raw sentences only Word-tag annotated sentences Part-of-speech dictionary Dependency trees

Answer: B
Explanation:

Supervised models need labeled corpora to learn emission + transition probabilities.

Supervised POS HMM — Why labeled data is required?

A supervised POS HMM needs already labeled data so it can learn probabilities:

Transition probabilities

P(t_k | t_k-1) or P(t_k | t_k-1, t_k-2)

These require tag sequences.

Emission probabilities

P(w_k | t_k)

These require each word paired with its correct tag.

Therefore, supervised training must have sentences where every word already has a POS tag.

Example training line:

Dogs/NOUN chase/VERB cats/NOUN

13. Decoding in POS HMM refers to:

Parameter estimation Tokenizing words Selecting most likely tag sequence Expanding vocabulary

Answer: C
Explanation:

Decoding maps observed words to best hidden tag sequence using Viterbi.

In a POS HMM (Hidden Markov Model), decoding means: Finding the most probable sequence of POS tags for a given sequence of words. This is usually done using the Viterbi algorithm.

So, decoding = tagging = choosing the best tag path.

14. Forward-Backward algorithm is mainly used for:

Viterbi decoding Unsupervised HMM learning POS dictionary building Tokenization

Answer: B
Explanation:

It computes expected probabilities used in Baum-Welch EM training.

More information:

The Forward–Backward algorithm is the core of the Baum–Welch algorithm, which is used for: Unsupervised training of Hidden Markov Models (HMMs).

In unsupervised learning, the data has no POS tags, so the model must estimate: Transition probabilities, Emission probabilities

The Forward–Backward algorithm computes:

Forward probabilities α
Backward probabilities β
Expected counts for transitions and emissions

These expected counts are then used to re-estimate HMM parameters. This is EM (Expectation–Maximization).

15. Smoothing in HMM prevents:

Overtraining Hidden state ambiguity Viterbi path errors Zero-probability transitions

Answer: D
Explanation:

Unseen tag-word or tag-tag pairs must not be zero → smoothing distributes probability.

In an HMM, probabilities are estimated from counts in the training data. If a transition or a word–tag pair never appears in training data, its probability becomes zero. This is dangerous because:

A zero probability wipes out entire Viterbi paths
The model cannot handle unseen words or unseen tag transitions

Smoothing (like Laplace, Good–Turing, Witten–Bell) adds a small nonzero probability to unseen events.

So smoothing prevents: Zero-probability transitions and zero-probability emissions.

16. A core limitation of POS HMM is that it:

Cannot tag new words Assumes Markov + word independence Requires deep networks Needs semantic embeddings

Answer: B
Explanation:

HMM relies only on previous tag and assumes words depend only on tag, limiting context.

Markov + word independence - A limiation in HMM

A standard POS HMM makes two strong assumptions.

Markov Assumption (for tags): The current tag depends only on a small number of previous tags.
- This ignores long-range syntactic dependencies (e.g., subject–verb agreement across clauses).
Output Independence Assumption (for words): Words depend only on their own tag, not surrounding words.
- This ignores context that modern taggers use (e.g., CRFs, BiLSTMs, Transformers).

These assumptions simplify the model, but they also severely limit accuracy compared to modern NLP models.

17. Increasing hidden states in POS HMM generally:

May cause overfitting Guarantees higher accuracy Does nothing to model quality Reduces computation

Answer: A
Explanation:

More states = more parameters → risk of overfitting & slower inference.

Increasing hidden states in POS HMM may cause overfitting

In a Hidden Markov Model (HMM) used for Part-of-Speech (POS) tagging, the "hidden states" correspond to the POS tags (like Noun, Verb, Adjective). Increasing the number of hidden states means using a more granular tagset (e.g., splitting "Noun" into "Singular Noun" and "Plural Noun") or simply increasing the model's capacity in an unsupervised setting.

Effect of increasing hidden states - Discussion

When you increase the number of states N:

You must estimate many more parameters.
But your dataset size stays the same.

So the model tries to estimate:

Many more transition probabilities (N²),
Many more emission probabilities (N × V).

With limited data, the HMM begins to:

Fit the quirks/noise of the training data,
Memorize rare patterns,
Over-specialize to word sequences it has seen,
Lose its ability to generalize to unseen text.

This phenomenon is overfitting.

18. Rare words are best handled using:

Laplace / Good-Turing smoothing Discarding them Forcing one tag Ignoring in training

Answer: A
Explanation:

Smoothing reallocates probability mass → better tagging for unseen/low-freq words.

19. A trigram HMM improves tagging by modeling:

No transition Two historical tags One future tag Word similarity

Answer: B
Explanation:

Trigram uses P(tᵢ | tᵢ₋₁, tᵢ₋₂) → better captures context patterns.

20. Unsupervised POS HMM accuracy increases with:

Random initialization Morphological features + smoothing Deleting rare words Using only transitions

Answer: B
Explanation:

Morphology aids tagging without labels → suffix, prefix, capitalization rules.

Tuesday, December 2, 2025

Master HMM with MCQs – Hidden Markov Model Tagging Explained

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

Hidden Markov Model - MCQs - Problem-based Practice Questions

HMM-Based POS Tagging Practice

These questions explore key aspects of Hidden Markov Model (HMM) based Part-of-Speech (POS) tagging. Some questions explicitly provide prior (initial) probabilities, while others focus only on transition and emission probabilities. You will practice:

Calculating posterior probabilities for individual words.
Evaluating sequence likelihoods using transitions and emissions.
Handling unseen words with smoothing techniques.
Determining most likely tag sequences based on high-probability transitions.

1. Consider the following HMM for POS tagging:

Emission Probabilities	Transition Probabilities
P(dog \| Noun) = 0.6	P(next = Noun \| current = Noun) = 0.4
P(dog \| Verb) = 0.1	P(next = Verb \| current = Noun) = 0.6
P(runs \| Noun) = 0.1	P(next = Noun \| current = Verb) = 0.5
P(runs \| Verb) = 0.7	P(next = Verb \| current = Verb) = 0.5

In the table, 'next' and 'current' in the probability P(next = Noun | current = Noun) refer to 'POS tag of next word' and 'POS tag of current word' respectively.
Which is the most likely tag sequence for the sentence “dog runs” using the HMM?

A. Noun → Noun
B. Noun → Verb
C. Verb → Noun
D. Verb → Verb

Answer: B
Explanation:

P(Noun→Verb) = 0.6 × 0.6 × 0.7 = 0.252. Highest likelihood = Noun→Verb.

Step-by-Step Probability Computation

We compute the probability for each possible tag sequence using:

P(t₂ | t₁) × P(dog | t₁) × P(runs | t₂)

1. Sequence: Noun → Noun

P(dog | Noun) = 0.6
P(Noun | Noun) = 0.4
P(runs | Noun) = 0.1

0.6 × 0.4 × 0.1 = 0.024

2. Sequence: Noun → Verb

P(dog | Noun) = 0.6
P(Verb | Noun) = 0.6
P(runs | Verb) = 0.7

0.6 × 0.6 × 0.7 = 0.252

3. Sequence: Verb → Noun

P(dog | Verb) = 0.1
P(Noun | Verb) = 0.5
P(runs | Noun) = 0.1

0.1 × 0.5 × 0.1 = 0.005

4. Sequence: Verb → Verb

P(dog | Verb) = 0.1
P(Verb | Verb) = 0.5
P(runs | Verb) = 0.7

0.1 × 0.5 × 0.7 = 0.035

Highest Probability = 0.252

Most likely tag sequence:

B. Noun → Verb

2. For the HMM below:

Initial tag probabilities:
• P(Noun) = 0.7
• P(Adj) = 0.3

Emission probabilities for the word "red":
• P("red" | Noun) = 0.2
• P("red" | Adj) = 0.8

Calculate the normalized probability that 'red' is tagged as an Adjective.

A. 0.14
B. 0.56
C. 0.63
D. 0.24

Answer: C
Explanation:

Compute unnormalized scores:
• Adj = 0.3 × 0.8 = 0.24
• Noun = 0.7 × 0.2 = 0.14

Normalize to get posterior:
Adj = 0.24 / (0.24 + 0.14) ≈ 0.63.

Calculate the probability of a specific tag assignment for a single observed word

In a Hidden Markov Model (HMM), the probability of a specific tag assignment for a single observed word is calculated using the Joint Probability of the tag and the word. This is often referred to as the "Viterbi score" or "path probability" for that specific tag.

The formula for the joint probability of a single state (tag) and observation (word) is:

P(Tag,Word) = P(Tag) × P(Word∣Tag)

Where:

P(Tag) is the Initial tag probability (Prior).
P(Word∣Tag) is the Emission probability (Likelihood).

Step 1: Calculate the Joint Probabilities for All Tags

For each possible tag, compute the product of the initial probability and emission probability:

For tag = Adjective (Adj):

P(Adj,"red") = P(Adj) × P("red"∣Adj) = 0.3 × 0.8 = 0.24

For tag = Noun:

P(Noun,"red") = P(Noun) × P("red"∣Noun) = 0.7 × 0.2 = 0.14

Step 2: Calculate the Normalizing Constant (Total Probability)

The normalizing constant is the sum of all joint probabilities:

P("red") = P(Adj,"red") + P(Noun,"red") = 0.24 + 0.14 = 0.38

Step 3: Apply Bayes' Theorem to Get the Posterior Probability

Using the normalization formula:

P(Adj∣"red") = P(Adj,"red") / P("red") = 0.24 / 0.38

Simplifying:

P(Adj∣"red") = 0.24 / 0.38 ≈ 0.6316 or approximately 63.16%

Final Answer

The normalized probability that 'red' is tagged as an Adjective is:

P(Adj∣"red") = 0.24 / 0.38 ≈ 0.632 or 63.2%

3. Using the HMM:

Transition	Det	Noun
Det	0.1	0.9
Noun	0.4	0.6

Emission P(word\|tag)	Det	Noun
"the"	0.8	0.05
"cat"	0.01	0.9

Most likely tagging for "the cat" is:

A. Det → Det
B. Det → Noun
C. Noun → Det
D. Noun → Noun

Answer: B
Explanation:

Solution: Let's solve this step by step using the Hidden Markov Model (HMM)

The goal is to find the most likely sequence of tags for the sentence "the cat" using the Viterbi principle.

Step 1: Understand the tables

Transition probabilities (P(tag₂ | tag₁)):

From\To	Det	Noun
Det	0.1	0.9
Noun	0.4	0.6

For example, if the previous tag is Det, the probability that the next tag is Noun is 0.9.

Emission probabilities (P(word | tag)):

Word	Det	Noun
the	0.8	0.05
cat	0.01	0.9

For example, the probability that the word "cat" is emitted by a Noun is 0.9.

Step 2: Compute joint probabilities for all sequences

We consider all possible tag sequences for "the cat":

Det → Det

P(Det → Det) = transition × emission
Step 1 (first word "the" as Det): P("the"|Det) = 0.8
Step 2 (second word "cat" as Det): transition P(Det|Det) = 0.1, emission P("cat"|Det) = 0.01

Total probability = 0.8 × 0.1 × 0.01 = 0.0008

Det → Noun

Step 1 "the" as Det: P("the"|Det) = 0.8
Step 2 "cat" as Noun: transition P(Noun|Det) = 0.9, emission P("cat"|Noun) = 0.9

Total probability = 0.8 × 0.9 × 0.9 = 0.648

Noun → Det

Step 1 "the" as Noun: P("the"|Noun) = 0.05
Step 2 "cat" as Det: transition P(Det|Noun) = 0.4, emission P("cat"|Det) = 0.01

Total probability = 0.05 × 0.4 × 0.01 = 0.0002

Noun → Noun

Step 1 "the" as Noun: P("the"|Noun) = 0.05
Step 2 "cat" as Noun: transition P(Noun|Noun) = 0.6, emission P("cat"|Noun) = 0.9

Total probability = 0.05 × 0.6 × 0.9 = 0.027

Step 3: Compare probabilities

Sequence	Probability
Det → Det	0.0008
Det → Noun	0.648
Noun → Det	0.0002
Noun → Noun	0.027

Step 4: Most likely tagging

The most likely sequence is: Det → Noun (Option B)

"the" is a determiner (Det), and "cat" is a noun (Noun)

4. Given the emission matrix:

Word → Tag	Verb	Adv
"quickly"	0.2	0.7

If P(Verb)=0.5 and P(Adv)=0.5 initially, probability the word "quickly" is tagged Adv:

A. 0.41
B. 0.55
C. 0.78
D. 0.64

Answer: C
Explanation:

P(Tag,Word) = P(Tag) × P(Word∣Tag)

If "quickly" is tagged as Verb: 0.5×0.2=0.10;

If "quickly" is tagged as Adv: 0.5×0.7=0.35.

Highest is Adv. Hence, normalized Adv = 0.35/(0.10+0.35) = 0.35/0.45 ≈ 0.78.

5. A word appears 10 times as Noun and 2 times as Verb in training. Without smoothing P(word|Noun)= ?

A. 0.2
B. 0.5
C. 0.83
D. 0.91

Answer: C
Explanation:

P(word|noun) means "Out of all times the word occurs, how many times did it occur with the tag Noun?". We are not doing any smoothing—just using raw counts.

The word appears 10 times as Noun

The same word appears 2 times as Verb

Total appearances of the word = 10 + 2 = 12

Since we want P(word | Noun):

P(word | Noun) = Count(word with Noun) / Total count of the word

Substitute the values

P(word | Noun) = 10 / 12 = 0.8333

Rounded: 0.83

6. In an HMM POS-tagger, we want to estimate the emission probability of an unseen word. Consider the word "glorf", which never occurred in the training data.
For the tag Noun, the training corpus contains:

Total noun-tagged word tokens = 50
Count of "glorf" = 0
Vocabulary size (unique words) = 10

Using Add-1 (Laplace) smoothing, compute 𝑃("glorf" ∣ Noun).

A. 1/60
B. 1/51
C. 1/61
D. 51/61

Answer: A
Explanation:

Laplace smoothing → (0+1)/(50 + 10) = 1/60.

Understanding the question

We have an unseen word: "glorf"
That means in the training data count("glorf" | Noun) = 0.
We want to compute P("glorf" | Noun) using Add-1 (Laplace) smoothing.

✅ Given

Total noun tokens	50
Count of "glorf" under Noun	0
Vocabulary size (V)	10

Add-1 smoothing formula

P(w | tag) = (count(w, tag) + 1) / (total tokens under tag + V)

Step-by-step calculation

P("glorf" | Noun) = (0 + 1) / (50 + 10) = 1 / 60

7. Which sentence has lower HMM likelihood given high Verb→Noun transition?

A. eat food
B. food eat

Answer: B
Explanation:

"food eat" requires Noun→Verb, which may be low and less natural under English HMM statistics. Because its tag sequence (Noun → Verb) does NOT match the high-probability Verb → Noun transition that the HMM expects.

"eat food" (Verb -> Noun) has HIGH HMM likelihood

"food eat" (Noun -> Verb) has LOW HMM likelihood

8. Given partial Viterbi table:

t	word	best tag	prob
1	fish	Noun	0.52
2	swim	Verb	0.46

Assume the HMM has a strong Verb → Noun transition (i.e., P(Noun|Verb) is high).

Model predicts next tag likely:

A. Noun
B. Verb
C. Both equal
D. Cannot determine

Answer: A
Explanation:

Since the best tag at t=2 is Verb, the predicted next tag depends mainly on the transition probabilities from Verb. The question explicitly states that Verb → Noun transition is strong. Therefore, the HMM expects the next tag to be Noun with highest probability.

Why the Viterbi algorithm predicts Noun as the next tag

The Viterbi algorithm will predict Noun as the most likely next tag because:

High transition probability boost: P(Noun|Verb) is high, which significantly increases the probability of the Noun path.
Natural language patterns: Verbs commonly take noun objects in English (for example, "swim laps", "fish upstream"), so Verb → Noun sequences are frequent.
Viterbi maximization: The algorithm selects the tag sequence that produces the maximum accumulated probability. With a strong Verb→Noun transition, the Noun path will typically have a higher accumulated probability than alternatives.

The strong transition probability from Verb to Noun makes this the most likely prediction for the next tag in the sequence.

9. In an HMM for POS tagging, you are given the following transition probabilities for adjectives:

An adjective is followed by a noun with probability 0.75
An adjective is followed by another adjective with probability 0.10

These probabilities tell us which tags usually come after an adjective in the training data.

Using only these transition probabilities, which 2-word phrase does the HMM consider more likely?

A. beautiful red
B. beautiful flower

10. In an HMM POS tagger, you observe the single word "cat". The model gives you the following probabilities:

Tag Transition	Probability
DT → NN	0.8
DT → VB	0.2

Emission	"cat"
NN emits "cat"	0.7
VB emits "cat"	0.1

For this one-word sentence, the tag is chosen mainly based on the emission probability of the word. Based on these values, which tag is the HMM most likely to assign to the word "cat"?

A. DT
B. NN
C. VB
D. Cannot determine

Major links

Quicklinks

Monday, January 5, 2026

HMM POS Tagging MCQs (Advanced) | Viterbi, Baum-Welch & NLP Concepts

Step-by-step Explanation

What does uniform / random transitions mean?

What remains?

Why do we need a backpointer table?

Simple Viterbi Example

Viterbi Probability Table

Backpointer Table

Interpretation

Fundamental Distinction: Discrete vs. Continuous Observations

Why Gaussian Emission HMMs Fit Speech Data

Wednesday, December 17, 2025

Hidden Markov Model (HMM) – MCQs, Notes & Practice Questions | ExploreDatabase

Key Concepts Illustrated in the Figure

Maximum Likelihood Estimation (MLE)

Baum–Welch Method

What does the Baum–Welch method do?

In Simple Terms

Trigram Model

POS Tagging as a Sequence Labeling Task

What is Sequence Labeling?

POS Tagging as Sequence Labeling

Thursday, December 11, 2025

Top 10 Advanced HMM for POS Tagging — Important MCQs (2nd Order, Trigram, Smoothing, Viterbi)

Advanced POS Tagging with Hidden Markov Models — MCQ Introduction

What you’ll learn

Supervised POS HMM — Why labeled data is required?

Markov + word independence - A limiation in HMM

Increasing hidden states in POS HMM may cause overfitting

Effect of increasing hidden states - Discussion

Tuesday, December 2, 2025

Master HMM with MCQs – Hidden Markov Model Tagging Explained

Hidden Markov Model - MCQs - Problem-based Practice Questions

HMM-Based POS Tagging Practice

Step-by-Step Probability Computation

Highest Probability = 0.252

Calculate the probability of a specific tag assignment for a single observed word

Step 1: Calculate the Joint Probabilities for All Tags

Step 2: Calculate the Normalizing Constant (Total Probability)

Step 3: Apply Bayes' Theorem to Get the Posterior Probability

Final Answer

Solution: Let's solve this step by step using the Hidden Markov Model (HMM)

Step 1: Understand the tables

Step 2: Compute joint probabilities for all sequences

Det → Det

Det → Noun

Noun → Det

Noun → Noun

Step 3: Compare probabilities

Step 4: Most likely tagging

Substitute the values

Understanding the question

✅ Given

Add-1 smoothing formula

Step-by-step calculation

Why the Viterbi algorithm predicts Noun as the next tag

Understanding the Question

Given

Analysis

The Answer

Featured Content

Multiple choice questions in Natural Language Processing Home

All time most popular contents

Report Abuse