Which feature improves unknown word tagging accuracy the most in HMM POS tagging?

Learning prefix and suffix distributions per POS tag improves accuracy by leveraging morphological cues.

Training an HMM with labeled POS data is classified as what type of learning?

It is supervised learning since both words and POS tags are known during training.

Which algorithm is used to train an HMM when POS tags are not available?

The Forward-Backward (Baum-Welch) algorithm is used for unsupervised HMM training.

When is error propagation more likely in Viterbi decoding?

When a rare word has a sharply peaked but incorrect emission probability, causing early path commitment.

Why do modern POS taggers outperform HMM-based models?

Neural models capture long-range context and subword features beyond the Markov assumption.

What does a high P(NN | DT) transition probability indicate?

It indicates that nouns are likely to follow determiners, which is common in English syntax.

How is sentence probability computed in an HMM?

It is computed as the product of transition probabilities and emission probabilities across the sentence.

What information is stored in the Viterbi backpointer table?

It stores the most likely previous tag for reconstructing the optimal tag sequence.

Why are Gaussian emission HMMs preferred in speech tagging?

Because speech features are continuous-valued, making Gaussian distributions suitable.

HMM POS Tagging MCQs (Advanced) | Viterbi, Baum-Welch & NLP Concepts

Q: If transitions are uniform/random, HMM POS tagger becomes?

It becomes equivalent to a unigram emission-based selector, relying only on P(word | tag).

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.

20. If transitions are uniform/random, HMM POS tagger becomes:

A. Fully deterministic
B. Distribution-free classifier
C. Equivalent to unigram emission selector
D. Memory-based classifier

Correct Answer: C

With uniform transitions, tagging depends only on P(word|tag), i.e., emission probabilities.

With uniform transitions, an HMM POS tagger reduces to a unigram model that tags each word independently using emission probabilities only.

Step-by-step Explanation

An HMM POS tagger assigns part-of-speech tags using two probabilities:

Transition probability - P(t_i | t_i−1)
→ How likely a tag follows the previous tag

Emission probability - P(w_i | t_i)
→ How likely a word is generated by a tag

During decoding using the Viterbi algorithm, the model maximizes:

P(t_i | t_i−1) × P(w_i | t_i)

What does uniform / random transitions mean?

Uniform transitions imply:

P(t_i | t_i−1) = constant for all tag pairs

Transition probabilities do not prefer any particular tag sequence
They contribute the same value for every possible path

Therefore, transition probabilities no longer influence the tagging decision.

What remains?

Only the emission probabilities matter:

arg max_{t_i} P(w_i | t_i)

This is exactly what a unigram POS tagger does:

Assigns each word the tag with the highest emission probability
Ignores contextual information entirely

21. Unknown word tagging accuracy is highest when model learns:

A. Prefix/suffix distribution per POS
B. Word frequency only
C. Stopword probability
D. Character bigrams only

Correct Answer: A

This question is about how POS taggers handle unknown (out-of-vocabulary) words—words that were not seen during training.

In Part-of-Speech (POS) tagging, unknown words present a fundamental challenge—they don't appear in the training corpus, so the model cannot rely on learned word-to-tag associations. The solution lies in morphological features, particularly prefix and suffix distributions linked to grammatical categories. Morphological cues like -ly, -ness, -tion strongly correlate with POS tags.

Prefix/suffix distribution per POS

Why? Many parts of speech follow strong morphological patterns:

-tion, -ness → Noun
-ly → Adverb
-ing, -ed → Verb
un-, re-, pre- → Verbs / Adjectives

By learning which prefixes and suffixes are likely for each POS, the model can:

Infer the POS of new (unknown) words it has never seen

This is the most effective and widely used approach in POS tagging models such as HMMs, CRFs, and neural taggers.

Therefore, unknown word tagging accuracy is highest when the model learns prefix/suffix distributions per POS.

22. Training HMM with labeled POS corpus is:

A. Unsupervised
B. Supervised
C. Reinforcement-based
D. Zero-shot

Correct Answer: B

Both words and tags are known, so probabilities are estimated directly.

23. If only words are available (no tags), HMM must be trained using:

A. Viterbi only
B. Forward-Backward (Baum-Welch)
C. MEMM
D. CRF

Correct Answer: B

Baum–Welch uses EM to estimate hidden states from unlabeled data.

This question tests your understanding of the three fundamental HMM problems and when to apply each algorithm.

This question asks about the Learning problem of three HMM problems. As per the Learning problem, we are given only the observation sequence without tags, and we need to find the model parameters with the help of Forward-Backward (Baum-Welch) algorithm. This is unsupervised learning.

24. Error propagation in Viterbi decoding is more likely when:

A. Transition matrix is dense
B. Emission probability is sharply peaked for rare word
C. All tags have equal probability
D. Frequent tags dominate data

Correct Answer: B

A wrong but strong emission can lock Viterbi into an incorrect path.

More information:

The question is about Viterbi decoding. Viterbi decoding is used in POS tagging and other sequence labeling tasks. In these tasks, each tag depends on the previous tag.

What is error propagation?

In Viterbi decoding: The algorithm selects the best path step by step. If a wrong tag is chosen early, that wrong choice affects the next tags. As a result, more errors occur later. This spreading of mistakes is called error propagation.

Why does error propagation happen in Viterbi algorithm?

Viterbi uses dynamic programming. It keeps only one best path, not many alternatives. It depends on: Transition probabilities (from one tag to the next) and Emission probabilities (tag to word). If the model is too confident about a wrong tag, the error continues through the sentence.

Why option B is correct?

For a rare or unknown word, the model assigns: Very high probability to one tag and Very low probability to others. If that high-probability tag is wrong, Viterbi commits strongly to it and the alternative paths are discarded. As a result, future tags are forced to follow this wrong tag via transitions

25. Modern POS taggers outperform HMM mainly because:

A. HMM is non-probabilistic
B. Neural models capture long-context + subword info
C. HMM has no decoding algorithm
D. HMM only works on small corpora

Correct Answer: B

Neural models capture global dependencies beyond Markov assumptions.

26. High P(NN | DT) indicates:

A. Noun unlikely after determiner
B. Noun likely after determiner
C. Determiner depends on word length
D. Transition invalid

Correct Answer: B

Determiner → noun is a common English syntactic pattern.

P(NN | DT) means, "The probability that the next tag is a Noun (NN), given that the current tag is a Determiner (DT)".

High probability indicates that Determiners are usually followed by Nouns in language data. Example: "the book", "a pen", "this idea", etc.

27. Sentence probability in an HMM is:

A. Sum of emission probabilities
B. Sum of transitions
C. Sum of products of transition and emission probabilities across all possible state sequences
D. Ratio of emissions

Correct Answer: C

HMM probability is the product over all transition and emission terms.

Sentence probability in HMM = the sum of probabilities of all possible hidden state sequences that could generate the observed sentence, where each path's probability is the product of its transition and emission probabilities.

This probabilistic framework is what makes HMMs powerful for sequence modeling in NLP and speech processing—it elegantly handles the uncertainty of hidden states while maintaining computational efficiency through dynamic programming.

28. Viterbi backpointer table stores:

A. Loss function values
B. Most likely previous tag
C. Word embeddings
D. Vocabulary index

Correct Answer: B

Backpointers reconstruct the optimal tag sequence.

The Viterbi backpointer table is a table that stores where each best probability came from during Viterbi decoding. In simple words, it remembers which previous state (tag) gave the best path to the current state.

Why do we need a backpointer table?

Viterbi decoding has two main steps: (1) Forward pass (Compute the best probability of reaching each state at each time step, Store these probabilities in the Viterbi table.), (2) Backtracking (Recover the best sequence of tags, For this, we must know which state we came from). The backpointer table makes backtracking possible.

Simple Viterbi Example

Sentence: "Time flies"

Possible Tags:

NN (Noun)
VB (Verb)

Viterbi Probability Table

Time	NN	VB
t = 1	0.3	0.7
t = 2	0.6	0.4

Backpointer Table

Time	NN	VB
t = 2	VB	NN

Interpretation

Best path to NN at t = 2 came from VB at t = 1.
Best path to VB at t = 2 came from NN at t = 1.

29. Gaussian emission HMMs are preferred in speech tagging because:

A. Text is discrete
B. Speech features are continuous
C. POS-tags depend on semantics
D. They remove smoothing

Correct Answer: B

Acoustic signals are continuous-valued, well-modeled by Gaussians.

Fundamental Distinction: Discrete vs. Continuous Observations

The choice between discrete and Gaussian (continuous) HMM emission distributions depends entirely on the nature of the observations being modeled.

Discrete HMMs represent observations as discrete symbols from a finite alphabet—such as words in part-of-speech tagging or written text. When observations are discrete, emission probabilities are modeled as categorical distributions over symbol categories.

Continuous (Gaussian) HMMs represent observations as continuous-valued feature vectors. When observations are real-valued, discrete emission probabilities are not applicable; instead, the probability density is modeled using continuous distributions such as Gaussians or Gaussian Mixture Models (GMMs).

Why Gaussian Emission HMMs Fit Speech Data

In speech tagging or speech recognition, the observed data are acoustic features, not words.

Hidden Markov Models require an emission model to represent:

P(observation | state)

In text POS tagging → observations are discrete words
In speech tagging → observations are continuous feature vectors

Therefore, Gaussian (or Gaussian Mixture) distributions are ideal for modeling continuous acoustic data.

Gaussian-emission HMMs model:

P(x_t | state_t)

where x_t is a continuous acoustic feature vector.

30. HMM POS tagging underperforms neural models mainly because it:

A. Requires GPUs
B. Models short context only
C. Cannot generate emissions
D. Lacks training algorithms

Correct Answer: B

HMMs rely on local Markov assumptions, unlike deep contextual models.

HMMs underperform because they rely on short-context Markov assumptions, while neural models capture long-range and global linguistic information.

HMM-based POS taggers rely on the Markov assumption, typically using bigram or trigram tagging. This means the POS tag at position t depends only on a very limited local context.

P(t_t | t_t−1) or P(t_t | t_t−1, t_t−2)

In other words, Hidden Markov Models (HMMs) assume:

The current POS tag depends only on a limited number of previous tags (usually one in bigram HMMs, two in trigram HMMs).
They cannot look far ahead or far behind in a sentence.

Example:

“The book you gave me yesterday was interesting.”

To correctly tag the word “was”, the model benefits from understanding long-distance syntactic relationships. However, HMMs cannot effectively capture such long-range dependencies.

Why neural models perform better

Modern neural POS taggers such as BiLSTM, Transformer, and BERT:

Capture long-range dependencies across the sentence
Use bidirectional context (both left and right)
Learn character-level and subword features
Handle unknown and rare words more effectively

Major links

Quicklinks

Monday, January 5, 2026