✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.
Advanced POS Tagging with Hidden Markov Models — MCQ Introduction
Master POS tagging with this focused MCQ set on Hidden Markov Models (HMM): second-order & trigram models, Viterbi decoding, Baum-Welch training, smoothing techniques, and practical tips for supervised & unsupervised learning.
Part-of-speech (POS) tagging is a cornerstone task in Natural Language Processing (NLP) that assigns grammatical categories (noun, verb, adjective, etc.) to each token in a sentence. This question set concentrates on classical statistical taggers built with Hidden Markov Models (HMMs), which model tag sequences as hidden states and observed words as emissions.
HMM-based POS taggers remain valuable for their interpretability and efficiency. They are particularly useful when you need:
- Lightweight, fast taggers for resource-constrained systems
- Explainable probabilistic models for linguistic analysis and teaching
- Strong baselines before moving to neural models like BiLSTM or Transformer taggers
This MCQ collection targets advanced HMM concepts — including second-order (trigram) models, the Viterbi decoding algorithm, Forward–Backward / Baum–Welch for unsupervised learning, and various smoothing strategies to handle rare or unseen words. Each question includes a concise explanation to help you understand not only the correct choice but why it matters for real-world POS tagging.
What you’ll learn
- How higher-order HMMs (trigram / second-order) capture broader tag context.
- Why supervised training requires labeled word–tag corpora and how unsupervised EM works.
- The purpose of smoothing (Laplace, Good-Turing) to avoid zero probabilities.
- Trade-offs: model complexity, overfitting, and inference cost when increasing hidden states.
Use these MCQs to prepare for exams, interviews, or to evaluate your grounding before progressing to neural POS taggers. Scroll down to start the questions and test your understanding of HMM-based POS tagging fundamentals and advanced techniques.
A. One previous tag
B. Two previous tags
C. No previous tag
D. All future tags
Explanation:
Higher-order HMM models allow P(tᵢ | tᵢ₋₁, tᵢ₋₂) improving context.
A second-order POS HMM (Hidden Markov Model) is also called a trigram HMM.
It means: P(tᵢ | tᵢ₋₁, tᵢ₋₂).This tells us: The probability of the current tag depends on two previous tags. Therefore, the HMM "looks back" two steps in the tag sequence.
Example: Let us take a simple sentence "dog chase cats". For the last word “cats”, whose tag is t₃, and given the previous two tags: t₂ = VERB (tag for “chase”) t₁ = NOUN (tag for “dogs”), you should write:A. Raw sentences only
B. Word-tag annotated sentences
C. Part-of-speech dictionary
D. Dependency trees
Explanation:
Supervised models need labeled corpora to learn emission + transition probabilities.
Supervised POS HMM — Why labeled data is required?
A supervised POS HMM needs already labeled data so it can learn probabilities:
These require tag sequences.
These require each word paired with its correct tag.
Therefore, supervised training must have sentences where every word already has a POS tag.
Dogs/NOUN chase/VERB cats/NOUNA. Parameter estimation
B. Tokenizing words
C. Selecting most likely tag sequence
D. Expanding vocabulary
Explanation:
Decoding maps observed words to best hidden tag sequence using Viterbi.
In a POS HMM (Hidden Markov Model), decoding means: Finding the most probable sequence of POS tags for a given sequence of words. This is usually done using the Viterbi algorithm.
So, decoding = tagging = choosing the best tag path.A. Viterbi decoding
B. Unsupervised HMM learning
C. POS dictionary building
D. Tokenization
Explanation:
It computes expected probabilities used in Baum-Welch EM training.
More information:The Forward–Backward algorithm is the core of the Baum–Welch algorithm, which is used for: Unsupervised training of Hidden Markov Models (HMMs).
In unsupervised learning, the data has no POS tags, so the model must estimate: Transition probabilities, Emission probabilities
The Forward–Backward algorithm computes:
- Forward probabilities α
- Backward probabilities β
- Expected counts for transitions and emissions
A. Overtraining
B. Hidden state ambiguity
C. Viterbi path errors
D. Zero-probability transitions
Explanation:
Unseen tag-word or tag-tag pairs must not be zero → smoothing distributes probability.
In an HMM, probabilities are estimated from counts in the training data. If a transition or a word–tag pair never appears in training data, its probability becomes zero. This is dangerous because:
- A zero probability wipes out entire Viterbi paths
- The model cannot handle unseen words or unseen tag transitions
Smoothing (like Laplace, Good–Turing, Witten–Bell) adds a small nonzero probability to unseen events.
So smoothing prevents: Zero-probability transitions and zero-probability emissions.
A. Cannot tag new words
B. Assumes Markov + word independence
C. Requires deep networks
D. Needs semantic embeddings
Explanation:
HMM relies only on previous tag and assumes words depend only on tag, limiting context.
Markov + word independence - A limiation in HMM
A standard POS HMM makes two strong assumptions.
- Markov Assumption (for tags): The current tag depends only on a small number of previous tags.
- This ignores long-range syntactic dependencies (e.g., subject–verb agreement across clauses).
- Output Independence Assumption (for words): Words depend only on their own tag, not surrounding words.
- This ignores context that modern taggers use (e.g., CRFs, BiLSTMs, Transformers).
These assumptions simplify the model, but they also severely limit accuracy compared to modern NLP models.
A. May cause overfitting
B. Guarantees higher accuracy
C. Does nothing to model quality
D. Reduces computation
Explanation:
More states = more parameters → risk of overfitting & slower inference.
Increasing hidden states in POS HMM may cause overfitting
In a Hidden Markov Model (HMM) used for Part-of-Speech (POS) tagging, the "hidden states" correspond to the POS tags (like Noun, Verb, Adjective). Increasing the number of hidden states means using a more granular tagset (e.g., splitting "Noun" into "Singular Noun" and "Plural Noun") or simply increasing the model's capacity in an unsupervised setting.
Effect of increasing hidden states - Discussion
When you increase the number of states N:
- You must estimate many more parameters.
- But your dataset size stays the same.
So the model tries to estimate:
- Many more transition probabilities (N2),
- Many more emission probabilities (N × V).
With limited data, the HMM begins to:
- Fit the quirks/noise of the training data,
- Memorize rare patterns,
- Over-specialize to word sequences it has seen,
- Lose its ability to generalize to unseen text.
This phenomenon is overfitting.
A. Laplace / Good-Turing smoothing
B. Discarding them
C. Forcing one tag
D. Ignoring in training
Explanation:
Smoothing reallocates probability mass → better tagging for unseen/low-freq words.
A. No transition
B. Two historical tags
C. One future tag
D. Word similarity
Explanation:
Trigram uses P(tᵢ | tᵢ₋₁, tᵢ₋₂) → better captures context patterns.
A. Random initialization
B. Morphological features + smoothing
C. Deleting rare words
D. Using only transitions
Explanation:
Morphology aids tagging without labels → suffix, prefix, capitalization rules.
No comments:
Post a Comment