Major links



Quicklinks


📌 Quick Links
[ DBMS ] [ DDB ] [ ML ] [ DL ] [ NLP ] [ DSA ] [ PDB ] [ DWDM ] [ Quizzes ]


Showing posts with label NLP Quiz Questions. Show all posts
Showing posts with label NLP Quiz Questions. Show all posts

Monday, February 9, 2026

Top 10 Syntactic Analysis MCQs in NLP with Answers | Dependency, Parsing & Neural NLP

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

☰ Quick Links - Browse Related MCQs
🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Quiz Mode:

Top 10 Syntactic Analysis MCQs in NLP (With Detailed Explanations)

Syntactic analysis is a core component of Natural Language Processing (NLP), enabling machines to understand grammatical structure and word relationships. This post presents 10 carefully selected multiple-choice questions (MCQs) on syntactic parsing, dependency structures, neural parsing, and modern NLP concepts. Each question includes a clear explanation to help students prepare for exams, interviews, and competitive tests.

1.
Which parsing strategy is most suitable for handling multiple valid parse trees for a sentence?






Correct Answer: C

Many sentences in natural language are ambiguous, meaning they can have multiple valid parse trees. For instance, structural ambiguity produces multiple parse trees. Consider an example sentence "I saw a man with a telescope" which can be interpreted as follows;

  • I used a telescope to see the man.
  • The man I saw had a telescope.

So the parser must choose the most likely structure, not just any valid one. Probabilistic parsers assign probabilities and choose the most likely structure.

Why probabilistic parsers?

They assign probabilities to grammar rules or parse trees. Evaluates all possible parses. Selects the most probable parse.

Example probabilisitc parsers:

  • PCFG (Probabilistic Context-Free Grammar)
  • Neural dependency parsers with scoring

Because ambiguity results in multiple possible parses, we need a ranking mechanism and probabilities provide that.

2.
What is the main advantage of dependency parsing over constituency parsing for modern NLP systems?






Correct Answer: B

Dependency trees directly model head–dependent relations, making them simpler, efficient, and useful for downstream tasks like translation and information extraction. Dependency parsing is preferred in modern NLP because it gives simple, efficient structures with direct word-to-word relationships.

What is Constituency Parsing?

Constituency parsing divides a sentence into phrases (constituents) based on grammar rules.

Example:

"The boy ate an apple".

Structure:

[ Sentence

[Noun Phrase: The boy]

[Verb Phrase: ate [Noun Phrase: an apple]]

]

The main focus of constituency parsing is on phrase structure (NP, VP, PP, etc.)

What is Dependency Parsing?

Dependency parsing shows direct relationships between words.

Example: ate → root; boy → subject of ate (nsubj); apple → object of ate (obj); The → modifier of boy;

The main focus of dependency parsing is to capture: Who depends on whom (word-to-word relations)

Main advantage of dependency parsing over constituency parsing for modern NLP systems

Modern NLP tasks (machine translation, information extraction, question answering, etc.) mainly need: Direct word relationships, Simpler structure, Computational efficiency.

Why dependency parsing is better?

Fewer nodes (only words, no extra phrase nodes), Simpler trees, Direct relations like subject, object, modifier, and Faster and easier for machine learning models

Why other options are INCORRECT?

Option A: Captures phrase boundaries more precisely. INCORRECT. That is the strength of constituency parsing, not dependency.

Option C: Requires no training data. INCORRECT. Modern parsers require training.

Option D: Works only for English. INCORRECT. Dependency parsing works for many languages.

3.
A dependency parser must support non-projective parsing when:






Correct Answer: B

Non-projective structures occur when dependency arcs cross. Crossing arcs indicate that the sentence structure cannot be represented using projective constraints. Some languages require parsers that can handle such structures. More can be found below.

What is dependency parser?

A dependency parser analyzes the grammatical structure of a sentence by identifying head–dependent relationships between words. Each word (except the root) depends on another word (its head). The structure forms a dependency tree. Example: “Ram wrote a letter”. Here, "wrote" is root, "Ram" is subject of wrote, and "letter" is object of wrote.

Dependency parsing focuses on word-to-word relations, which is very useful for modern NLP tasks.

What are Projective and Non-Projective Dependency parsing?

  • Projective dependency: A dependency tree is projective if when dependencies are drawn as arcs above the sentence and no arcs cross each other. All relations can be drawn without crossing. Projective structures are common in fixed word-order languages like English.
  • Non-projective dependency: A dependency tree is non-projective if some dependency arcs cross when drawn over the sentence. This often happens due to free word order, long-distance dependency and scrambling (common in languages like Hindi, German, Czech, Tamil, etc.). Some dependencies cross when drawn. Non-projective parsing is needed to correctly represent such structures.

Why other options are INCORRECT?

Sentence length (Option A) does not cause non-projectivity.

Unknown words (Option C) relate to vocabulary issues, not structure.

Grammar ambiguity (Option D) affects interpretation but does not necessarily create crossing dependencies.

4.
What is a key limitation of greedy transition-based parsers?






Correct Answer: C

Transition-based parsers make local decisions. Early mistakes cannot be corrected later (The parser cannot go back and fix the mistake), leading to error propagation.

What are Greedy Transition-Based Parsers?

Transition-based dependency parsers build a dependency tree step-by-step using a sequence of actions (such as, SHIFT, LEFT-ARC, RIGHT-ARC, REDUCE). A greedy parser chooses the best action at each step based only on current information. It does not reconsider previous decisions. They are very fast and memory-efficient.

Why other options are INCORRECT?

High memory usage (Option A) Greedy parsers use low memory.

Cannot handle short sentences (Option B) Works for any sentence length.

Cannot produce dependency trees (Option D) Produces trees efficiently.

5.
Graph-based dependency parsing differs from transition-based parsing because it:






Correct Answer: B

Graph-based parsers evaluate possible trees globally and select the highest-scoring structure, reducing greedy errors.

Transition-Based vs Graph-Based dependency parsing

  • Transition-Based Parsing: Builds the dependency tree step-by-step. Uses a stack, buffer, and actions (SHIFT, LEFT-ARC, RIGHT-ARC). Decisions are local and incremental.
  • Graph-Based Parsing: Treats parsing as a global optimization problem. Considers all possible head–dependent arcs. Assigns a score to the entire tree. Selects the highest-scoring valid tree (often using algorithms like MST – Maximum Spanning Tree)

Why other options are INCORRECT?

Builds the tree incrementally using a stack (Option A) This describes transition-based parsing, not graph-based.

Uses no machine learning (Option C) Modern graph-based parsers heavily use machine learning (neural networks).

Works only for projective trees (Option D) Many graph-based methods can handle non-projective trees (e.g., MST parser).

6.
In modern neural parsers, contextual embeddings like BERT help because they:






Correct Answer: C

Contextual embeddings capture agreement, phrase boundaries, and long-distance dependencies, improving parsing accuracy. BERT helps parsers by understanding each word based on its context, which improves detection of grammatical relationships.

Why embeddings are important in modern neural parsers?

Neural parsers (dependency or constituency) work with numbers, not words. So each word must be converted into a vector representation — this is called an embedding.

Embeddings are essential in modern neural parsers because they:

  • Convert words into numerical input
  • Capture semantic and syntactic information
  • Provide contextual understanding (with BERT)
  • Significantly improve parsing performance.
7.
In dependency parsing, the head selection problem refers to:






Correct Answer: C

Dependency parsing determines which word acts as the head and which is the dependent for each relationship.

What is the head selection problem?

In dependency parsing, for each word, the parser must decide "Which word is its head?". This decision is called head selection. For example, given a sentence "She saw a dog", for the word "dog", the parser must decide does "dog" depend on "saw"? Or on "a"?. So the head selection task is about choosing the governing word for each word.

Head selection = deciding which word is the head (governor) for each word in the sentence.

8.
Why do traditional syntactic parsers struggle with very long sentences?






Correct Answer: B

They struggle because the number of possible parses and computational cost grow rapidly with sentence length, making long-distance dependencies hard to handle.

Traditional parsers struggle with long sentences because:

  • Too many possible structures
  • High computational cost
  • Difficulty handling long-distance relationships
  • Error propagation
9.
What is the main purpose of the Universal Dependencies (UD) framework?






Correct Answer: B

The Universal Dependencies (UD) framework is designed to create a consistent and language-independent way to represent grammatical structure (syntactic annotaion) across many languages.

Different languages have different grammar. UD provides a common set of rules and labels so that dependency structures look similar across languages and the same annotation scheme is used worldwide.

10.
How does syntactic information help large language models during training?






Correct Answer: B

Even without explicit parsing, LLMs learn syntax implicitly, helping capture agreement, clause structure, and long-distance dependencies.

More explanation:

Large Language Models (LLMs) need to understand how words in a sentence are related to each other. In many sentences, important grammatical relationships occur between words that are far apart.

Example:

The book that the student bought yesterday is interesting.

The verb is agrees with book, not with student or yesterday. This is called a long-range dependency.

Syntactic signals (such as dependency relations or structural patterns) help the model:

  • Identify subject–verb and modifier relationships
  • Understand sentence structure
  • Maintain grammatical consistency
  • Handle complex and long sentences

Without syntactic information, the model may rely only on nearby words and miss these long-distance relationships.

Tuesday, February 3, 2026

Transformer & LLM Architecture MCQs Explained | Deep Learning NLP

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

☰ Quick Links - Browse Related MCQs
🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Quiz Mode:

Conceptual MCQs on Transformer and Large Language Model Architectures

These conceptual multiple-choice questions (MCQs) cover core ideas behind Transformer architectures and large language models (LLMs), including BERT, GPT-style models, self-attention, positional embeddings, and in-context learning. Each question is followed by a clear, exam- and interview-oriented explanation.

These questions are useful for machine learning students, NLP researchers, and anyone preparing for deep learning exams or technical interviews.

Topics Covered

  • Self-attention vs RNNs
  • BERT masking strategy
  • GPT scaling behavior
  • Positional embeddings
  • In-context learning

1.
A shallow Transformer often outperforms a much deeper RNN on long documents. What is the primary architectural reason?






Correct Answer: B

This question acutally tests "Why can a shallow Transformer understand long documents better than a very deep RNN?".

The key difference lies in how information flows across long distances in a sequence. Transformers handle long-range dependencies efficiently because self-attention allows any token to directly attend to any other token, unlike RNNs where information must propagate sequentially.

What happens in an RNN?

In an RNN (even LSTM/GRU): Information from an early token must pass step by step through every intermediate token. For a document of length n, the dependency path length is O(n).

So for long documents, important early information gets weakened or distorted. Learning long-range dependencies becomes very hard. Eventhough the depth helps, it cannot remove this sequential bottleneck.

What happens in a Transformer?

With self-attention every token can directly attend to any other token. Dependency path length is O(1) (one attention step). So even with few layers, a word at the start can influence a word at the end immediately. Long-range relationships (coreference, topic continuity, constraints) are preserved.

This is why context window length matters more than depth for long documents.

In simpler terms,

Self-attention's ability to create direct connections between any two tokens in the sequence, regardless of their distance, fundamentally solves the long-range dependency problem that plagues RNNs. In a single attention operation, token i can directly interact with token j, establishing a maximum path length of O(1) between them. This architectural property means that:

  • Information flows directly
  • Gradients propagate effectively
  • Long-range dependencies become learnable

Why other options are INCORRECT?

Option A: INCORRECT. Embeddings are not the main reason — RNNs can also use high-quality embeddings.

Option C: INCORRECT. This applies mainly to BERT, not all Transformers. Also, bidirectionality alone does not solve long dependency paths.

Option D: INCORRECT. Parallelization affects training speed, not the model’s ability to understand long documents.

2.
Why does BERT mask only about 15% of tokens during masked language modeling?






Correct Answer: C

BERT masks only about 15% of tokens to ensure that most inputs during pretraining resemble real text, thereby reducing the distribution mismatch between pretraining and fine-tuning.


Explanation:

What is meant by distribution shift between pretraining and fine-tuning?

Distribution shift between pretraining and fine-tuning means the kind of inputs the model sees during pretraining are different from what it sees later when we actually use it.

Alternate definition: Distribution shift is the mismatch between the input patterns seen during model training and those encountered during fine-tuning or inference.

Example: During BERT pretraining, inputs are with [MASK]. "The capital of France is [MASK]". During fine-tuning/real use, the same input looks without [MASK] like "The capital of France is Paris."

Why masking of tokens help in solving the distribution shift?

By including unmasked tokens and random tokens mixed in with [MASK] tokens during pretraining, the model builds robustness to inputs without the special masking token. When fine-tuning arrives and the [MASK] token suddenly disappears, the model has already learned patterns that apply to token-level inputs that aren't artificially masked. This softens the domain gap and improves transfer learning performance.

Why masking only 15% of tokens help?

Masking fewer tokens means: 85% of tokens remain normal. Sentence structure is mostly realistic. [MASK] tokens are rare, not dominant. So the training distribution stays close to the fine-tuning distribution.

Why other options are INCORRECT?

Option A: INCORRECT. Masking more or fewer tokens doesn’t meaningfully change training time.

Option B: INCORRECT. Actually, masking more tokens would make copying harder. And, this doesn’t explain why only 15%.

Option D: INCORRECT. Hardness is not the goal; representation quality and transferability is.

3.
Why do GPT-style decoder-only models scale better for text generation than encoder–decoder models?






Correct Answer: C

Encoder–decoder models are great for input to output transformation (e.g., translation, summarization). Whereas, GPT-style decoder-only models are ideal for pure text generation. As scale increases, simplicity and training–inference alignment combo wins. That’s why large-scale language generation today is dominated by GPT-style architectures.

GPT-style models are decoder-only and autoregressive.

GPT-style models are decoder-only and autoregressive. That means they are trained to do one simple thing: Given the previous tokens, predict the next token. This is exactly the same thing they do at inference time when generating text.

  • During training: Predict next token using only past tokens (causal / left-to-right attention).
  • During inference: Predict next token using only past tokens.

No mismatch between training and generation. This perfect alignment becomes more important as models scale to billions of parameters and massive datasets, which is why GPT-style models scale so well.

Why other options are INCORRECT?

Option A: They require fewer parameters. INCORRECT. Not true. GPT models often have more parameters than encoder–decoder models like T5. Scaling success is not about being smaller.

Option B: They avoid bidirectional attention. INCORRECT. Avoiding bidirectional attention is a constraint, not the reason for better scaling. Bidirectional attention is powerful for understanding tasks, but it doesn’t help generation.

Option D: They do not use positional embeddings. INCORRECT. They absolutely do use positional information (absolute or rotary positional embeddings). So this option is factually incorrect.

4.
What happens if positional embeddings are completely removed from a Transformer model?






Correct Answer: B

Removing positional embeddings makes a Transformer permutation-invariant, so it cannot model word order.

What are positional embeddings?

Positional embeddings are vectors added to token embeddings to encode the order of tokens in a sequence. Transformers process all tokens in parallel, so unlike RNNs or LSTMs, they have no built-in sense of order. Positional embeddings fix this.

Why they are needed (simple intuition)

Take these two sentences: “dog bites man” “man bites dog” They have the same words, but different meanings. Without positional embeddings: A Transformer sees both as the same bag of words With positional embeddings: “dog” at position 1 ≠ “dog” at position 3 Order becomes meaningful.

What happens if completely removed?

A Transformer’s self-attention mechanism, by itself, does not know word order.

  • Self-attention only looks at token embeddings and similarities between tokens.
  • It treats the input as a set, not a sequence.

If you completely remove positional embeddings: The model cannot tell whether the input is "dog bites man" or "man bites dog". Any permutation of tokens produces the same attention pattern. So the model becomes permutation-invariant (order doesn’t matter).

5.
Why do large language models often outperform BERT on commonsense reasoning tasks?






Correct Answer: C

Autoregressive LLMs outperform BERT on commonsense reasoning because they learn world knowledge and multi-step reasoning by continuously predicting future tokens.

What does commonsense reasoning require?

Commonsense reasoning often requires:

  • Temporal flow (e.g., “what happens next?”)
  • Causal reasoning (e.g., “if this happened, then what follows?”)
  • Multi-step inference across several ideas

How Large Language Models Acquire Commonsense Knowledge?

Large language models (such as GPT-style models) are typically autoregressive. This means they are trained to repeatedly answer a single core question:

“Given everything so far, what comes next?”

To perform this task accurately, the model must learn patterns that go far beyond individual words. Over time, it acquires:

  • How events usually unfold over time
  • Cause–effect relationships in the real world
  • Everyday facts and common situations
  • Multi-step reasoning and inference patterns

Across billions of prediction steps, the model implicitly accumulates commonsense knowledge and learns how to chain ideas together, which is essential for reasoning tasks.

Why BERT Struggles More with Commonsense Reasoning?

BERT follows a fundamentally different training strategy:

  • It is trained using masked language modeling, where random words are hidden and must be predicted.
  • It uses bidirectional context, but focuses on local sentence-level understanding.
  • It excels at language understanding tasks such as classification, named entity recognition (NER), and semantic similarity.
  • However, it is not trained to generate long sequences or reasoning chains.

Since BERT was not optimized for sequential prediction and reasoning, it typically underperforms autoregressive models on commonsense reasoning tasks.

6.
Which limitation of BERT makes it less suitable for tasks requiring multi-step reasoning?






Correct Answer: C

BERT is trained with masked language modeling for sentence-level understanding, not for sequential generation, making it weaker at multi-step reasoning tasks.

More explanation:


What is multi-step reasoning?

Multi-step reasoning is the ability to arrive at an answer by going through a sequence of intermediate logical steps, where each step depends on the previous one. Instead of jumping straight to the answer, the model (or person) has to chain several inferences together.

LLM Example: Multi-Step Reasoning

Question:

If all neural networks are models, and transformers are neural networks, what are transformers?

Reasoning Steps:

  1. Neural networks ⊆ models
  2. Transformers ⊆ neural networks
  3. Therefore, transformers ⊆ models

Each step builds on the previous one, illustrating how multi-step reasoning combines intermediate inferences to reach a final conclusion.

Why multi-step reasoning is crucial for language models?

Multi-step reasoning is crucial for:

  • commonsense reasoning
  • logical inference
  • causal and temporal questions etc.

Autoregressive LLMs are better at this because they generate text step by step, naturally mirroring the reasoning process.

7.
The primary benefit of using multiple attention heads in Transformers is that they:






Correct Answer: B

Multiple attention heads allow a Transformer to attend to different types of relationships between tokens simultaneously.

What multiple attention heads actually do?

In a Transformer, attention decides which other tokens a word should focus on. When we use multiple attention heads, we don’t just repeat the same attention—we let the model look at the sentence in different ways at the same time.

Multi-head attention splits the query, key, and value projections into multiple "heads," each operating in parallel on lower-dimensional subspaces of the input embeddings. This allows each head to specialize in distinct relationships—like syntactic dependencies in one head, semantic patterns in another, or positional cues in yet another—before concatenating and linearly transforming the outputs. A single head would force all relationships into one averaged attention pattern, creating an information bottleneck and limiting expressiveness.

8.
Why do LLMs with fixed context windows struggle with very long documents?






Correct Answer: B

LLMs with fixed context windows struggle with long documents because self-attention scales quadratically in compute and memory with sequence length.

What does “fixed context window” mean?

LLMs process text in chunks called context windows (for example, 2k, 8k, 32k tokens). Inside one window, every token attends to every other token using self-attention. That’s powerful — but expensive.

Why self-attention becomes a problem for long documents?

In self-attention each of n tokens compares itself with n other tokens. This creates an n × n attention matrix. So, compute cost grows as O(n2) and memory usage also grows as O(n2).

As the document gets longer:

  • GPU memory fills up quickly.
  • Computation becomes slow or infeasible.
  • The model must truncate, slide windows, or summarize instead of reading everything.

This is the core reason long documents are hard.

9.
Why can a large language model adapt to a new task using only a few examples provided in the prompt?






Correct Answer: C

Large language models (LLMs) adapt to new tasks through in-context learning, where few-shot examples in the prompt act as contextual signals that guide the model's predictions without altering its fixed parameters. The attention mechanisms process these examples alongside the input, enabling pattern recognition and task generalization during inference.

What’s Actually Happening? / What is in-context learning?

When you give a large language model (LLM) a prompt like:

"Translate English to French:dog → chien; cat → chat; house → "

The model is not learning in the usual machine learning sense. No parameters or weights inside the model are being updated.

Instead, the examples in the prompt act like temporary instructions that influence the model’s next prediction.

During pretraining, the model has learned that:

  • Patterns in the recent context matter
  • Earlier input–output pairs often define a task

As a result, the model treats the examples as conditioning signals:

  • “The task here is translation”
  • “The mapping pattern is English → French”
  • “I should continue this pattern”

This phenomenon is known as in-context learning.

10.
Which property makes Transformers theoretically capable of approximating any sequence-to-sequence function?






Correct Answer: C

Transformers are universal sequence-to-sequence approximators because self-attention enables global interaction and feed-forward networks provide nonlinear expressiveness.

Explanation:

Self-attention computes contextual mappings, allowing each position to weigh relationships across the entire input sequence dynamically. In simpler terms, self-attention decides what information to gather and from where.

Feed-forward networks (position-wise MLPs) provide non-linear value transformations, enabling approximation of arbitrary functions when stacked with attention. In simpler terms, feed-forward networks decide how to transform that information.

Together, they can represent any mapping from an input sequence to an output sequence, in theory, assuming sufficient depth, width, and data.

Why other options are INCORRECT?

Option A: Layer normalization. INCORRECT. Helps stabilize and speed up training, but does not increase representational capacity.

Option B: Residual connections. INCORRECT. Improve gradient flow and optimization; they don’t make the model more expressive.

Option D: Tokenization strategy. INCORRECT. A preprocessing choice, not a source of theoretical function-approximation power.

Sunday, January 25, 2026

Shallow Parsing in NLP – Top 10 MCQs with Answers (Chunking)

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

☰ Quick Links - Browse Related MCQs
🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Quiz Mode:

Introduction

Shallow parsing, also known as chunking, is a foundational technique in Natural Language Processing (NLP) that focuses on identifying flat, non-recursive phrase structures such as noun phrases, verb phrases, and prepositional phrases from POS-tagged text. Unlike deep parsing, which attempts to build complete syntactic trees, shallow parsing prioritizes efficiency, robustness, and scalability, making it a preferred choice in large-scale NLP pipelines.

This MCQ set is designed to test both conceptual understanding and implementation-level knowledge of shallow parsing. The questions cover key aspects including design philosophy, chunk properties, finite-state models (FSA and FST), BIO tagging schemes, and statistical sequence labeling approaches such as Conditional Random Fields (CRFs). These questions are particularly useful for students studying NLP, Computational Linguistics, Information Retrieval, and AI, as well as for exam preparation and interview revision.

Try to reason through each question before revealing the answer to strengthen your understanding of how shallow parsing operates in theory and practice.


1.
Which statement best captures the primary design philosophy of shallow parsing?






Correct Answer: C

Shallow parsing trades depth and linguistic completeness for efficiency and robustness.

Shallow parsing (chunking) is designed to identify basic phrases like noun phrases (NP), verb phrases (VP), etc., to avoid recursion and nesting, and to keep the analysis fast, simple, and robust

Because of this design choice, shallow parsing scales well to large corpora, works better with noisy or imperfect POS tagging, and is practical for real-world NLP pipelines (IR, IE, preprocessing)

2.
Why is shallow parsing preferred over deep parsing in large-scale NLP pipelines?






Correct Answer: C

Shallow parsing is preferred over deep parsing because it is computationally faster and more robust to noise while providing sufficient structural information for many NLP tasks.

Shallow parsing is preferred over deep parsing mainly because it is faster, simpler, and more robust, especially in real-world NLP systems. Following are the reasons;

  • Computational efficiency: Shallow parsing works with local patterns over POS tags. It avoids building full syntactic trees. Much faster and uses less memory than deep parsing
  • Robustness to noisy data: Shallow parsing tolerates errors because it matches short, local tag sequences
  • Scalability: Suitable for large-scale text processing
  • Lower resource requirements: Shallow parsing can be implemented using Finite-state automata, regular expressions, and sequence labeling models (e.g., CRFs)

For more information, visit

Shallow parsing (chunking) VS Deep parsing

3.
The phrase patterns used in shallow parsing are most appropriately modeled as:






Correct Answer: B

Phrase patterns in shallow parsing are best modeled as regular expressions / regular languages because chunking is local, linear, non-recursive, and non-overlapping. All of these properties fit exactly within the expressive power of regular languages.

Why the phrase patterns used in shallow parsing are modeled as regular expressions/regular languages?

1. Shallow parsing works on POS tag sequences, not full syntax. In chunking, we usually operate on sequences like "DT JJ JJ NN VBZ DT NN" and define patterns such as "NP → DT? JJ* NN+". This is pattern matching over a flat sequence, not hierarchical structure building. That is exactly what regular expressions are designed for.

2. Chunk patterns are non-recursive. Regular languages cannot express recursion. Shallow parsing intentionally avoids recursion (No nested constituents). For example, "[NP the [NP quick brown fox]]" is not allowed in shallow parsing.

3. Chunks are non-overlapping. Each word belongs to at most one chunk. Example: "[NP the dog] [VP chased] [NP the cat]". There is no crossing or embedding like: "*[NP the dog chased] [NP the cat]". This strict linear segmentation matches the finite-state assumption. Since recursion is forbidden by design, CFG power is unnecessary.

4.
Which automaton is suitable for recognizing chunk patterns in rule-based shallow parsing over POS-tagged text?






Correct Answer: B

Why Deterministic finite state automaton (FSA) is suitable for recognizing chunk patterns in rule-based shallow parsing over POS-tagged text?

Chunk patterns in shallow parsing are regular and flat, so they can be efficiently recognized using a finite state automaton.

In rule-based shallow parsing (chunking), the goal is to recognize flat phrase patterns (such as noun phrases or verb phrases) in a linear sequence of POS tags, for example "DT JJ NN VBZ DT NN".

Chunk patterns are defined using regular expressions like "NP → DT? JJ* NN+".

Such patterns belong to the class of regular languages, which can be recognized by a finite state automaton (FSA). Therefore, a deterministic finite state automaton (FSA) is suitable for recognizing chunk patterns in rule-based shallow parsing. More powerful automata like pushdown automata or Turing machines are unnecessary because shallow parsing does not require recursion or unbounded memory.

5.
Why are finite-state transducers (FSTs) sometimes preferred over FSAs in shallow parsing?






Correct Answer: B

Finite-state transducers (FSTs) are sometimes preferred over finite-state automata (FSAs) in shallow parsing because they can both recognize patterns and produce output labels, whereas FSAs can only recognize whether a pattern matches.

In shallow parsing, the task is not just to detect that a sequence of POS tags forms a chunk, but also to label the chunk boundaries, such as assigning NP, VP, or BIO tags (B-NP, I-NP, O). An FST maps an input POS-tag sequence to an output sequence with chunk labels or brackets, making it well suited for this purpose.

Since shallow parsing involves flat, non-recursive, and local patterns, the power of finite-state models is sufficient. Using an FST adds practical usefulness by enabling annotation and transformation, while retaining the efficiency and simplicity of finite-state processing.

6.
In the BIO chunk tagging scheme, the tag B-NP indicates:






Correct Answer: B

BIO chunk tagging scheme in shallow parsing - short notes

The BIO chunk tagging scheme is a commonly used method in shallow parsing (chunking) to label phrase boundaries in a sequence of tokens.

BIO stands for:

  • B (Begin) – marks the first word of a chunk
  • I (Inside) – marks words inside the same chunk
  • O (Outside) – marks words that are not part of any chunk

Each B and I tag is usually combined with a chunk type, such as NP (noun phrase) or VP (verb phrase).

Example:

The   quick  brown  fox   jumps
B-NP  I-NP   I-NP   I-NP  B-VP

The BIO tagging scheme represents flat, non-overlapping chunks, avoids hierarchical or nested structures, and converts chunking into a sequence labeling problem. Due to its simplicity and clarity, it is widely used in rule-based, statistical, and neural-network-based shallow parsing systems.

7.
Which property must hold for chunks produced by shallow parsing?






8.
When shallow parsing is formulated as a sequence labeling problem, which probabilistic model is commonly used?






Correct Answer: C

What is Conditional Random Field (CRF)?

A CRF (Conditional Random Field) is a probabilistic, discriminative model used for sequence labeling tasks in machine learning and natural language processing.

A Conditional Random Field models the probability of a label sequence given an input sequence, i.e., P(Y | X), where X is the observation sequence and Y is the corresponding label sequence.

What CRFs are used for?

CRFs are commonly used in NLP tasks such as Shallow parsing (chunking), Named Entity Recognition (NER), Part-of-Speech tagging, Information extraction.

Why CRF is used for shallow parsing?

Conditional Random Fields (CRFs) are used for shallow parsing because shallow parsing is naturally a sequence labeling problem, and CRFs are designed to model dependencies between neighboring labels in a sequence.

9.
Shallow parsing is less sensitive to POS tagging errors than deep parsing because:






Correct Answer: C

Shallow parsing is less sensitive to POS tagging errors because it relies on local patterns and partial structure, not a full grammatical tree. So a small POS mistake usually affects only one chunk, not the whole analysis.

Deep parsing, on the other hand, tries to build a complete syntactic tree, where one wrong POS tag can break the entire parse.

Why shallow parsing is less sensitive to POS tagging errors?

Shallow parsing (chunking) groups words into flat chunks like NP (noun phrase), VP (verb phrase), etc. It uses local POS patterns.

If one POS tag is wrong, the damage is local, the chunk may still be mostly correct, and the neighboring chunks remain unaffected. Error does not propagate much.

Why deep parsing is more sensitive to POS tagging errors?

Deep parsing (full syntactic parsing) builds a hierarchical parse tree with dependencies between words. POS tags determine the Phrase boundaries, Head–dependent relations, and Overall sentence structure.

If a POS tag is wrong the parser may choose the wrong grammar rule, fail to build a valid tree, and may produce a completely incorrect parse. Error propagates through the entire tree.

Example:

In the sentence "The can rusts quickly", if the word "can" is wrongly tagged as a VERB instead of a NOUN,

  • Shallow parsing: Might still form a rough NP or VP and the error affects only one chunk.
  • Deep parsing: Subject–verb structure breaks and the whole sentence tree becomes invalid or wrong.
10.
Which of the following tasks lies just beyond the scope of shallow parsing?






Correct Answer: C

Shallow parsing cannot resolve subject-object dependencies. To resolve subject-object dependencies, it requires knowing who is the subject and who is the object, it needs syntactic relations across phrases.

In simpler terms, shallow parsing identifies flat phrase boundaries such as NP, VP, and PP, but does not determine grammatical relations like subject–object dependencies, which require deep syntactic analysis.

Please visit, subscribe and share 10 Minutes Lectures in Computer Science

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents