Major links



Quicklinks


📌 Quick Links
[ DBMS ] [ SQL ] [ DDB ] [ ML ] [ DL ] [ NLP ] [ DSA ] [ PDB ] [ DWDM ] [ Quizzes ]


Showing posts with label NLP Quiz Questions. Show all posts
Showing posts with label NLP Quiz Questions. Show all posts

Sunday, January 25, 2026

Shallow Parsing in NLP – Top 10 MCQs with Answers (Chunking)

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

☰ Quick Links - Browse Related MCQs
🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Quiz Mode:

Introduction

Shallow parsing, also known as chunking, is a foundational technique in Natural Language Processing (NLP) that focuses on identifying flat, non-recursive phrase structures such as noun phrases, verb phrases, and prepositional phrases from POS-tagged text. Unlike deep parsing, which attempts to build complete syntactic trees, shallow parsing prioritizes efficiency, robustness, and scalability, making it a preferred choice in large-scale NLP pipelines.

This MCQ set is designed to test both conceptual understanding and implementation-level knowledge of shallow parsing. The questions cover key aspects including design philosophy, chunk properties, finite-state models (FSA and FST), BIO tagging schemes, and statistical sequence labeling approaches such as Conditional Random Fields (CRFs). These questions are particularly useful for students studying NLP, Computational Linguistics, Information Retrieval, and AI, as well as for exam preparation and interview revision.

Try to reason through each question before revealing the answer to strengthen your understanding of how shallow parsing operates in theory and practice.


1.
Which statement best captures the primary design philosophy of shallow parsing?






Correct Answer: C

Shallow parsing trades depth and linguistic completeness for efficiency and robustness.

Shallow parsing (chunking) is designed to identify basic phrases like noun phrases (NP), verb phrases (VP), etc., to avoid recursion and nesting, and to keep the analysis fast, simple, and robust

Because of this design choice, shallow parsing scales well to large corpora, works better with noisy or imperfect POS tagging, and is practical for real-world NLP pipelines (IR, IE, preprocessing)

2.
Why is shallow parsing preferred over deep parsing in large-scale NLP pipelines?






Correct Answer: C

Shallow parsing is preferred over deep parsing because it is computationally faster and more robust to noise while providing sufficient structural information for many NLP tasks.

Shallow parsing is preferred over deep parsing mainly because it is faster, simpler, and more robust, especially in real-world NLP systems. Following are the reasons;

  • Computational efficiency: Shallow parsing works with local patterns over POS tags. It avoids building full syntactic trees. Much faster and uses less memory than deep parsing
  • Robustness to noisy data: Shallow parsing tolerates errors because it matches short, local tag sequences
  • Scalability: Suitable for large-scale text processing
  • Lower resource requirements: Shallow parsing can be implemented using Finite-state automata, regular expressions, and sequence labeling models (e.g., CRFs)

For more information, visit

Shallow parsing (chunking) VS Deep parsing

3.
The phrase patterns used in shallow parsing are most appropriately modeled as:






Correct Answer: B

Phrase patterns in shallow parsing are best modeled as regular expressions / regular languages because chunking is local, linear, non-recursive, and non-overlapping. All of these properties fit exactly within the expressive power of regular languages.

Why the phrase patterns used in shallow parsing are modeled as regular expressions/regular languages?

1. Shallow parsing works on POS tag sequences, not full syntax. In chunking, we usually operate on sequences like "DT JJ JJ NN VBZ DT NN" and define patterns such as "NP → DT? JJ* NN+". This is pattern matching over a flat sequence, not hierarchical structure building. That is exactly what regular expressions are designed for.

2. Chunk patterns are non-recursive. Regular languages cannot express recursion. Shallow parsing intentionally avoids recursion (No nested constituents). For example, "[NP the [NP quick brown fox]]" is not allowed in shallow parsing.

3. Chunks are non-overlapping. Each word belongs to at most one chunk. Example: "[NP the dog] [VP chased] [NP the cat]". There is no crossing or embedding like: "*[NP the dog chased] [NP the cat]". This strict linear segmentation matches the finite-state assumption. Since recursion is forbidden by design, CFG power is unnecessary.

4.
Which automaton is suitable for recognizing chunk patterns in rule-based shallow parsing over POS-tagged text?






Correct Answer: B

Why Deterministic finite state automaton (FSA) is suitable for recognizing chunk patterns in rule-based shallow parsing over POS-tagged text?

Chunk patterns in shallow parsing are regular and flat, so they can be efficiently recognized using a finite state automaton.

In rule-based shallow parsing (chunking), the goal is to recognize flat phrase patterns (such as noun phrases or verb phrases) in a linear sequence of POS tags, for example "DT JJ NN VBZ DT NN".

Chunk patterns are defined using regular expressions like "NP → DT? JJ* NN+".

Such patterns belong to the class of regular languages, which can be recognized by a finite state automaton (FSA). Therefore, a deterministic finite state automaton (FSA) is suitable for recognizing chunk patterns in rule-based shallow parsing. More powerful automata like pushdown automata or Turing machines are unnecessary because shallow parsing does not require recursion or unbounded memory.

5.
Why are finite-state transducers (FSTs) sometimes preferred over FSAs in shallow parsing?






Correct Answer: B

Finite-state transducers (FSTs) are sometimes preferred over finite-state automata (FSAs) in shallow parsing because they can both recognize patterns and produce output labels, whereas FSAs can only recognize whether a pattern matches.

In shallow parsing, the task is not just to detect that a sequence of POS tags forms a chunk, but also to label the chunk boundaries, such as assigning NP, VP, or BIO tags (B-NP, I-NP, O). An FST maps an input POS-tag sequence to an output sequence with chunk labels or brackets, making it well suited for this purpose.

Since shallow parsing involves flat, non-recursive, and local patterns, the power of finite-state models is sufficient. Using an FST adds practical usefulness by enabling annotation and transformation, while retaining the efficiency and simplicity of finite-state processing.

6.
In the BIO chunk tagging scheme, the tag B-NP indicates:






Correct Answer: B

BIO chunk tagging scheme in shallow parsing - short notes

The BIO chunk tagging scheme is a commonly used method in shallow parsing (chunking) to label phrase boundaries in a sequence of tokens.

BIO stands for:

  • B (Begin) – marks the first word of a chunk
  • I (Inside) – marks words inside the same chunk
  • O (Outside) – marks words that are not part of any chunk

Each B and I tag is usually combined with a chunk type, such as NP (noun phrase) or VP (verb phrase).

Example:

The   quick  brown  fox   jumps
B-NP  I-NP   I-NP   I-NP  B-VP

The BIO tagging scheme represents flat, non-overlapping chunks, avoids hierarchical or nested structures, and converts chunking into a sequence labeling problem. Due to its simplicity and clarity, it is widely used in rule-based, statistical, and neural-network-based shallow parsing systems.

7.
Which property must hold for chunks produced by shallow parsing?






8.
When shallow parsing is formulated as a sequence labeling problem, which probabilistic model is commonly used?






Correct Answer: C

What is Conditional Random Field (CRF)?

A CRF (Conditional Random Field) is a probabilistic, discriminative model used for sequence labeling tasks in machine learning and natural language processing.

A Conditional Random Field models the probability of a label sequence given an input sequence, i.e., P(Y | X), where X is the observation sequence and Y is the corresponding label sequence.

What CRFs are used for?

CRFs are commonly used in NLP tasks such as Shallow parsing (chunking), Named Entity Recognition (NER), Part-of-Speech tagging, Information extraction.

Why CRF is used for shallow parsing?

Conditional Random Fields (CRFs) are used for shallow parsing because shallow parsing is naturally a sequence labeling problem, and CRFs are designed to model dependencies between neighboring labels in a sequence.

9.
Shallow parsing is less sensitive to POS tagging errors than deep parsing because:






Correct Answer: C

10.
Which of the following tasks lies just beyond the scope of shallow parsing?






Correct Answer: C

Monday, January 5, 2026

HMM POS Tagging MCQs (Advanced) | Viterbi, Baum-Welch & NLP Concepts

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

☰ Quick Links - Browse Related MCQs
🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
20. If transitions are uniform/random, HMM POS tagger becomes:






Correct Answer: C

With uniform transitions, tagging depends only on P(word|tag), i.e., emission probabilities.

With uniform transitions, an HMM POS tagger reduces to a unigram model that tags each word independently using emission probabilities only.

Step-by-step Explanation

An HMM POS tagger assigns part-of-speech tags using two probabilities:

  1. Transition probability - P(ti | ti−1)
    → How likely a tag follows the previous tag

  2. Emission probability - P(wi | ti)
    → How likely a word is generated by a tag

During decoding using the Viterbi algorithm, the model maximizes:

P(ti | ti−1) × P(wi | ti)

What does uniform / random transitions mean?

Uniform transitions imply:

P(ti | ti−1) = constant for all tag pairs

  • Transition probabilities do not prefer any particular tag sequence
  • They contribute the same value for every possible path

Therefore, transition probabilities no longer influence the tagging decision.

What remains?

Only the emission probabilities matter:

arg maxti P(wi | ti)

This is exactly what a unigram POS tagger does:

  • Assigns each word the tag with the highest emission probability
  • Ignores contextual information entirely
21. Unknown word tagging accuracy is highest when model learns:






Correct Answer: A

This question is about how POS taggers handle unknown (out-of-vocabulary) words—words that were not seen during training.

In Part-of-Speech (POS) tagging, unknown words present a fundamental challenge—they don't appear in the training corpus, so the model cannot rely on learned word-to-tag associations. The solution lies in morphological features, particularly prefix and suffix distributions linked to grammatical categories. Morphological cues like -ly, -ness, -tion strongly correlate with POS tags.


Prefix/suffix distribution per POS

Why? Many parts of speech follow strong morphological patterns:
  • -tion, -ness → Noun
  • -ly → Adverb
  • -ing, -ed → Verb
  • un-, re-, pre- → Verbs / Adjectives
By learning which prefixes and suffixes are likely for each POS, the model can:
  • Infer the POS of new (unknown) words it has never seen
This is the most effective and widely used approach in POS tagging models such as HMMs, CRFs, and neural taggers.

Therefore, unknown word tagging accuracy is highest when the model learns prefix/suffix distributions per POS.
22. Training HMM with labeled POS corpus is:






Correct Answer: B

Both words and tags are known, so probabilities are estimated directly.

23. If only words are available (no tags), HMM must be trained using:






Correct Answer: B

Baum–Welch uses EM to estimate hidden states from unlabeled data.

This question tests your understanding of the three fundamental HMM problems and when to apply each algorithm.

This question asks about the Learning problem of three HMM problems. As per the Learning problem, we are given only the observation sequence without tags, and we need to find the model parameters with the help of Forward-Backward (Baum-Welch) algorithm. This is unsupervised learning.

24. Error propagation in Viterbi decoding is more likely when:






Correct Answer: B

A wrong but strong emission can lock Viterbi into an incorrect path.

More information:

The question is about Viterbi decoding. Viterbi decoding is used in POS tagging and other sequence labeling tasks. In these tasks, each tag depends on the previous tag.

What is error propagation?

In Viterbi decoding: The algorithm selects the best path step by step. If a wrong tag is chosen early, that wrong choice affects the next tags. As a result, more errors occur later. This spreading of mistakes is called error propagation.

Why does error propagation happen in Viterbi algorithm?

Viterbi uses dynamic programming. It keeps only one best path, not many alternatives. It depends on: Transition probabilities (from one tag to the next) and Emission probabilities (tag to word). If the model is too confident about a wrong tag, the error continues through the sentence.

Why option B is correct?

For a rare or unknown word, the model assigns: Very high probability to one tag and Very low probability to others. If that high-probability tag is wrong, Viterbi commits strongly to it and the alternative paths are discarded. As a result, future tags are forced to follow this wrong tag via transitions

25. Modern POS taggers outperform HMM mainly because:






Correct Answer: B

Neural models capture global dependencies beyond Markov assumptions.

26. High P(NN | DT) indicates:






Correct Answer: B

Determiner → noun is a common English syntactic pattern.

P(NN | DT) means, "The probability that the next tag is a Noun (NN), given that the current tag is a Determiner (DT)".

High probability indicates that Determiners are usually followed by Nouns in language data. Example: "the book", "a pen", "this idea", etc.

27. Sentence probability in an HMM is:






Correct Answer: C

HMM probability is the product over all transition and emission terms.

Sentence probability in HMM = the sum of probabilities of all possible hidden state sequences that could generate the observed sentence, where each path's probability is the product of its transition and emission probabilities.

This probabilistic framework is what makes HMMs powerful for sequence modeling in NLP and speech processing—it elegantly handles the uncertainty of hidden states while maintaining computational efficiency through dynamic programming.

28. Viterbi backpointer table stores:






Correct Answer: B

Backpointers reconstruct the optimal tag sequence.

The Viterbi backpointer table is a table that stores where each best probability came from during Viterbi decoding. In simple words, it remembers which previous state (tag) gave the best path to the current state.

Why do we need a backpointer table?

Viterbi decoding has two main steps: (1) Forward pass (Compute the best probability of reaching each state at each time step, Store these probabilities in the Viterbi table.), (2) Backtracking (Recover the best sequence of tags, For this, we must know which state we came from). The backpointer table makes backtracking possible.

Simple Viterbi Example

Sentence: "Time flies"

Possible Tags:

  • NN (Noun)
  • VB (Verb)

Viterbi Probability Table

Time NN VB
t = 1 0.3 0.7
t = 2 0.6 0.4

Backpointer Table

Time NN VB
t = 2 VB NN

Interpretation

  • Best path to NN at t = 2 came from VB at t = 1.
  • Best path to VB at t = 2 came from NN at t = 1.
29. Gaussian emission HMMs are preferred in speech tagging because:






Correct Answer: B

Acoustic signals are continuous-valued, well-modeled by Gaussians.

Fundamental Distinction: Discrete vs. Continuous Observations

The choice between discrete and Gaussian (continuous) HMM emission distributions depends entirely on the nature of the observations being modeled.

Discrete HMMs represent observations as discrete symbols from a finite alphabet—such as words in part-of-speech tagging or written text. When observations are discrete, emission probabilities are modeled as categorical distributions over symbol categories.

Continuous (Gaussian) HMMs represent observations as continuous-valued feature vectors. When observations are real-valued, discrete emission probabilities are not applicable; instead, the probability density is modeled using continuous distributions such as Gaussians or Gaussian Mixture Models (GMMs).

Why Gaussian Emission HMMs Fit Speech Data

In speech tagging or speech recognition, the observed data are acoustic features, not words.

Hidden Markov Models require an emission model to represent:

P(observation | state)

  • In text POS tagging → observations are discrete words
  • In speech tagging → observations are continuous feature vectors

Therefore, Gaussian (or Gaussian Mixture) distributions are ideal for modeling continuous acoustic data.

Gaussian-emission HMMs model:

P(xt | statet)

where xt is a continuous acoustic feature vector.

30. HMM POS tagging underperforms neural models mainly because it:






Correct Answer: B

HMMs rely on local Markov assumptions, unlike deep contextual models.

HMMs underperform because they rely on short-context Markov assumptions, while neural models capture long-range and global linguistic information.

HMM-based POS taggers rely on the Markov assumption, typically using bigram or trigram tagging. This means the POS tag at position t depends only on a very limited local context.

P(tt | tt−1)   or   P(tt | tt−1, tt−2)

In other words, Hidden Markov Models (HMMs) assume:

  • The current POS tag depends only on a limited number of previous tags (usually one in bigram HMMs, two in trigram HMMs).
  • They cannot look far ahead or far behind in a sentence.

Example:

“The book you gave me yesterday was interesting.”

To correctly tag the word “was”, the model benefits from understanding long-distance syntactic relationships. However, HMMs cannot effectively capture such long-range dependencies.

Why neural models perform better

Modern neural POS taggers such as BiLSTM, Transformer, and BERT:

  • Capture long-range dependencies across the sentence
  • Use bidirectional context (both left and right)
  • Learn character-level and subword features
  • Handle unknown and rare words more effectively

Wednesday, December 17, 2025

Hidden Markov Model (HMM) – MCQs, Notes & Practice Questions | ExploreDatabase

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

☰ Quick Links - Browse Related MCQs
🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.

Key Concepts Illustrated in the Figure

  1. Visible states (Observations)
    Visible states are the observed outputs of an HMM, such as words in a sentence. In the above figure, 'cat', 'purrs', etc are observations.
  2. Hidden states
    Hidden states are the unobserved underlying states (e.g., POS tags - 'DT', 'N', etc in the figure) that generate the visible observations.
  3. Transition probabilities
    Transition probabilities define the likelihood of moving from one hidden state to another. In the figure, this is represented by the arrows from one POS tag to the other. Example: P(N -> V) or P(V | N).
  4. Emission probabilities
    Emission probabilities define the likelihood of a visible observation being generated by a hidden state. In the figure, this is represented by the arrows from POS tags to words. Example: P(cat | N).
  5. POS tagging using HMM
    POS tagging using HMM models tags as hidden states and words as observations to find the most probable tag sequence.
  6. Evaluation problem
    The evaluation problem computes the probability of an observation sequence given an HMM.
  7. Forward algorithm
    The forward algorithm efficiently solves the evaluation problem using dynamic programming.
  8. Decoding problem
    The decoding problem finds the most probable hidden state sequence for a given observation sequence.
1. In POS tagging using HMM, the hidden states represent:






Correct Answer: B

In HMM-based POS tagging, tags are hidden states and words are observed symbols.

2. The most suitable algorithm for decoding the best POS sequence in HMM tagging is:






Correct Answer: D

Viterbi decoding finds the most probable hidden tag sequence.

3. Transition probabilities in HMM POS tagging define:






Correct Answer: B

Transition probability models tag-to-tag dependency. That is, the probability of a tag t given another tag t-1 which is previous tag. It is calculated using Maximum Likelihood Estimation (MLE) as follows;

Maximum Likelihood Estimation (MLE)

When the state sequence is known (for example, in POS tagging with labeled training data), the transition probability is estimated using Maximum Likelihood Estimation.

aij = Count(ti → tj) / Count(ti)

Where:

  • Count(ti → tj) is the number of times a POS tag ti is immediately followed by a POS tag tj in the training data.
  • Count(ti) is the total number of appearence of tag ti in the entire training data.

This estimation ensures that the transition probabilities for each state sum to 1.

For example, the transition probability P(Noun | Det) will be 6/10 or 0.6 if in the training corpus the tag sequence "Det Noun" (Eg. like in "The/Det cat/Noun" - this is called tagged training data) occurs 6 times and the tag "Det" alone appears 10 times overall.

4. Emission probability in POS tagging refers to:






Correct Answer: C

Emission probability is P(word | tag).

It answer the question "Given a particular POS tag, how likely is it that this tag generates (emits) a specific word?"

Emission probability calculation: Out of the total number of times a tag appears in the training data (eg. NOUN), how many times it appears as the tag of a given word (eg. "cat/NOUN").

5. Which problem does Baum–Welch training solve in HMM POS tagging?






Correct Answer: C

Baum–Welch (EM) learns transition and emission probabilities without labeled data.

Baum–Welch Method

The Baum–Welch method is an algorithm used to train a Hidden Markov Model (HMM) when the true state (tag) sequence is unknown.

What does the Baum–Welch method do?

It estimates (learns) the transition and emission probabilities of an HMM from unlabeled data.

In Simple Terms

  • You are given only observation sequences (e.g., words)
  • You do not know the hidden state sequence (e.g., POS tags)
  • Baum–Welch automatically learns the model parameters that best explain the data

The Baum–Welch method is used to train an HMM by estimating transition and emission probabilities from unlabeled observation sequences using EM. Baum-Welch is a special case of Expectation-Maximization (EM) algorithm.

6. If an HMM POS tagger has 50 tags and a 20,000-word vocabulary, the emission matrix size is:






Correct Answer: B

Rows correspond to tags and columns to words.

In an HMM POS tagger, the emission matrix represents:

P(word | tag)

So its dimensions are:

  • Rows = number of tags
  • Columns = vocabulary size

Given:

  • Number of tags = 50
  • Vocabulary size = 20,000

Emission matrix size:

50 × 20,000

7. A trigram HMM POS tagger models:






Correct Answer: B

Trigram models capture dependency on two previous tags.

Trigram Model

A trigram model assumes that the probability of a tag (or word) depends on the previous two tags.

P(ti | ti−1, ti−2)

In POS tagging using an HMM:

  • Transition probabilities are computed using trigrams of tags
  • The model captures more context than unigram or bigram models

Example:

If the previous two tags are DT and NN, the probability of the next tag VB is:

P(VB | DT, NN)

Note: In practice, smoothing and backoff are used because many trigrams are unseen.

8. Data sparsity in emission probabilities mostly occurs due to:






Correct Answer: B

Unseen words lead to zero emission probabilities without smoothing.

Data sparsity in emission probabilities means that many valid word–tag combinations were never seen during training, so their probabilities are zero or unreliable.

Data sparsity may occur due to one or more of the following;
  • Natural language has a very large vocabulary.
  • Training data is finite.
  • New or rare words often appear during test time.
As a result, many words in the test data were never observed with any tag during training.
9. A common solution for unknown words in HMM POS tagging is:






Correct Answer: B

Smoothing assigns non-zero probabilities to unseen events.


Refer here for more information about Laplace smoothing.
10. POS tagging is considered a:






Correct Answer: C

Each token is labeled sequentially → classic sequence labeling.

POS Tagging as a Sequence Labeling Task

POS tagging is a sequence labeling task because the goal is to assign a label (POS tag) to each element in a sequence (words in a sentence) while considering their order and context.


What is Sequence Labeling?

In sequence labeling, we:

  • Take an input sequence: w1, w2, …, wn
  • Produce an output label sequence: t1, t2, …, tn

Each input item receives one corresponding label, and the labels are not independent of each other.


POS Tagging as Sequence Labeling

Input sequence → words in a sentence

The / cat / sleeps

Output sequence → POS tags

DT / NN / VBZ

Each word must receive exactly one POS tag, and the choice of tag depends on:

  • The current word (emission probability)
  • The neighboring tags (context / transition probability)
Please visit, subscribe and share 10 Minutes Lectures in Computer Science

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents