ExploreDatabase – Your one-stop study guide for interview and semester exam preparations with solved questions, tutorials, GATE MCQs, online quizzes and notes on DBMS, Data Structures, Operating Systems, AI, Machine Learning and Natural Language Processing.
In Natural Language Processing (NLP), parsing refers to the process of analyzing the grammatical
structure of a sentence. Based on the depth of syntactic analysis required, parsing techniques are broadly
classified into shallow parsing and deep parsing.
Shallow parsing (chunking) focuses on identifying flat, non-recursive phrase units such as
noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). It avoids building complete parse trees,
making it computationally efficient, robust to noise, and suitable for large-scale NLP applications.
In contrast, deep parsing (full parsing) aims to construct a complete syntactic structure of a
sentence by capturing hierarchical relationships and long-distance dependencies. Although linguistically richer,
deep parsing is more computationally expensive and sensitive to errors.
The visuals and comparison table below clearly illustrate the conceptual and practical differences between
shallow parsing and deep parsing.
Comparison of Shallow Parsing (Chunking) and Deep Parsing (Full Parsing) in NLP
Aspect
๐ข Shallow Parsing (Chunking)
๐ต Deep Parsing (Full Parsing)
๐ฏ Main Goal
Identify simple (flat) phrase chunks
Build complete syntactic structure
๐ฆ Output
Flat, non-recursive phrases (NP, VP, PP)
Hierarchical parse trees
๐ Recursion
Not allowed
Allowed
๐ Phrase Overlap
No overlapping chunks
Overlapping via tree structure
๐ช Levels of Analysis
Partial / Surface levels
Complete syntactic analysis
⚙️ Grammar Used
Regular expressions, FSA, FST
CFGs, dependency grammars
⚡ Speed
Fast & efficient
Slower
⏱️ Computational Cost
Low
High
๐ก️ Error Tolerance
Robust to POS errors
Sensitive to errors
๐ Scalability
Suitable for large datasets
Not scalable for large corpora
๐ง Learning Models
Rule-based, CRFs, HMMs
Probabilistic & neural parsers
๐ Training Data
Minimal or optional
Large annotated treebanks
๐ Long-distance Dependencies
Not handled
Handled
๐งฉ Semantic Understanding
Limited
Strong
๐งช Typical Use Cases
Information extraction, preprocessing
Machine translation, grammar checking
๐ Example Task
Base noun phrase detection
Subject–object dependency analysis
๐ Examples
[NP The quick brown fox] [VP jumps] [PP over] [NP the lazy dog]
Subject–object dependency analysis
๐ Shallow Parsing (Chunking) Example
Consider the sentence:
"The quick brown fox jumps over the lazy dog."
Using the same sentence:
"The quick brown fox jumps over the lazy dog."
Deep parsing performs a complete syntactic analysis by generating a full
constituency parse tree or a dependency parse.
Unlike shallow parsing, it reveals hierarchical structure and nesting.
๐น Constituency Parse Tree
This parse tree reveals nested phrase structure:
The noun phrase NP ("The quick brown fox") functions as the subject.
The verb phrase VP contains a prepositional phrase PP
("over the lazy dog") acting as an object modifier.
✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.
☰ Quick Links - Browse Related MCQs
๐จ Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Quiz Mode:
Introduction
Shallow parsing, also known as chunking, is a foundational technique in Natural Language Processing (NLP) that focuses on identifying flat, non-recursive phrase structures such as noun phrases, verb phrases, and prepositional phrases from POS-tagged text. Unlike deep parsing, which attempts to build complete syntactic trees, shallow parsing prioritizes efficiency, robustness, and scalability, making it a preferred choice in large-scale NLP pipelines.
This MCQ set is designed to test both conceptual understanding and implementation-level knowledge of shallow parsing. The questions cover key aspects including design philosophy, chunk properties, finite-state models (FSA and FST), BIO tagging schemes, and statistical sequence labeling approaches such as Conditional Random Fields (CRFs). These questions are particularly useful for students studying NLP, Computational Linguistics, Information Retrieval, and AI, as well as for exam preparation and interview revision.
Try to reason through each question before revealing the answer to strengthen your understanding of how shallow parsing operates in theory and practice.
1.
Which statement best captures the primary design philosophy of shallow parsing?
Correct Answer: C
Shallow parsing trades depth and linguistic completeness for efficiency and robustness.
Shallow parsing (chunking) is designed to identify basic phrases like noun phrases (NP), verb phrases (VP), etc., to avoid recursion and nesting, and to keep the analysis fast, simple, and robust
Because of this design choice, shallow parsing scales well to large corpora, works better with noisy or imperfect POS tagging, and is practical for real-world NLP pipelines (IR, IE, preprocessing)
2.
Why is shallow parsing preferred over deep parsing in large-scale NLP pipelines?
Correct Answer: C
Shallow parsing is preferred over deep parsing because it is computationally faster and more robust to noise while providing sufficient structural information for many NLP tasks.
Shallow parsing is preferred over deep parsing mainly because it is faster, simpler, and more robust, especially in real-world NLP systems. Following are the reasons;
Computational efficiency: Shallow parsing works with local patterns over POS tags. It avoids building full syntactic trees. Much faster and uses less memory than deep parsing
Robustness to noisy data: Shallow parsing tolerates errors because it matches short, local tag sequences
Scalability: Suitable for large-scale text processing
Lower resource requirements: Shallow parsing can be implemented using Finite-state automata, regular expressions, and sequence labeling models (e.g., CRFs)
The phrase patterns used in shallow parsing are most appropriately modeled as:
Correct Answer: B
Phrase patterns in shallow parsing are best modeled as regular expressions / regular languages because chunking is local, linear, non-recursive, and non-overlapping. All of these properties fit exactly within the expressive power of regular languages.
Why the phrase patterns used in shallow parsing are modeled as regular expressions/regular languages?
1. Shallow parsing works on POS tag sequences, not full syntax. In chunking, we usually operate on sequences like "DT JJ JJ NN VBZ DT NN" and define patterns such as "NP → DT? JJ* NN+". This is pattern matching over a flat sequence, not hierarchical structure building. That is exactly what regular expressions are designed for.
2. Chunk patterns are non-recursive. Regular languages cannot express recursion. Shallow parsing intentionally avoids recursion (No nested constituents). For example, "[NP the [NP quick brown fox]]" is not allowed in shallow parsing.
3. Chunks are non-overlapping. Each word belongs to at most one chunk. Example: "[NP the dog] [VP chased] [NP the cat]". There is no crossing or embedding like: "*[NP the dog chased] [NP the cat]". This strict linear segmentation matches the finite-state assumption. Since recursion is forbidden by design, CFG power is unnecessary.
4.
Which automaton is suitable for recognizing chunk patterns in rule-based shallow parsing over POS-tagged text?
Correct Answer: B
Why Deterministic finite state automaton (FSA) is suitable for recognizing chunk patterns in rule-based shallow parsing over POS-tagged text?
Chunk patterns in shallow parsing are regular and flat, so they can be efficiently recognized using a finite state automaton.
In rule-based shallow parsing (chunking), the goal is to recognize flat phrase patterns (such as noun phrases or verb phrases) in a linear sequence of POS tags, for example "DT JJ NN VBZ DT NN".
Chunk patterns are defined using regular expressions like "NP → DT? JJ* NN+".
Such patterns belong to the class of regular languages, which can be recognized by a finite state automaton (FSA). Therefore, a deterministic finite state automaton (FSA) is suitable for recognizing chunk patterns in rule-based shallow parsing. More powerful automata like pushdown automata or Turing machines are unnecessary because shallow parsing does not require recursion or unbounded memory.
5.
Why are finite-state transducers (FSTs) sometimes preferred over FSAs in shallow parsing?
Correct Answer: B
Finite-state transducers (FSTs) are sometimes preferred over finite-state automata (FSAs) in shallow parsing because they can both recognize patterns and produce output labels, whereas FSAs can only recognize whether a pattern matches.
In shallow parsing, the task is not just to detect that a sequence of POS tags forms a chunk, but also to label the chunk boundaries, such as assigning NP, VP, or BIO tags (B-NP, I-NP, O). An FST maps an input POS-tag sequence to an output sequence with chunk labels or brackets, making it well suited for this purpose.
Since shallow parsing involves flat, non-recursive, and local patterns, the power of finite-state models is sufficient. Using an FST adds practical usefulness by enabling annotation and transformation, while retaining the efficiency and simplicity of finite-state processing.
6.
In the BIO chunk tagging scheme, the tag B-NP indicates:
Correct Answer: B
BIO chunk tagging scheme in shallow parsing - short notes
The BIO chunk tagging scheme is a commonly used method in
shallow parsing (chunking) to label phrase boundaries in a
sequence of tokens.
BIO stands for:
B (Begin) – marks the first word of a chunk
I (Inside) – marks words inside the same chunk
O (Outside) – marks words that are not part of any chunk
Each B and I tag is usually combined with a
chunk type, such as NP (noun phrase) or
VP (verb phrase).
Example:
The quick brown fox jumps
B-NP I-NP I-NP I-NP B-VP
The BIO tagging scheme represents flat, non-overlapping chunks,
avoids hierarchical or nested structures, and converts chunking into a
sequence labeling problem. Due to its simplicity and clarity,
it is widely used in rule-based, statistical, and neural-network-based shallow
parsing systems.
7.
Which property must hold for chunks produced by shallow parsing?
When shallow parsing is formulated as a sequence labeling problem, which probabilistic model is commonly used?
Correct Answer: C
What is Conditional Random Field (CRF)?
A CRF (Conditional Random Field) is a probabilistic, discriminative model used for sequence labeling tasks in machine learning and natural language processing.
A Conditional Random Field models the probability of a label sequence given an input sequence, i.e., P(Y | X), where X is the observation sequence and Y is the corresponding label sequence.
What CRFs are used for?
CRFs are commonly used in NLP tasks such as Shallow parsing (chunking), Named Entity Recognition (NER), Part-of-Speech tagging, Information extraction.
Why CRF is used for shallow parsing?
Conditional Random Fields (CRFs) are used for shallow parsing because shallow parsing is naturally a sequence labeling problem, and CRFs are designed to model dependencies between neighboring labels in a sequence.
9.
Shallow parsing is less sensitive to POS tagging errors than deep parsing because:
Correct Answer: C
Shallow parsing is less sensitive to POS tagging errors because it relies on local patterns and partial structure, not a full grammatical tree.
So a small POS mistake usually affects only one chunk, not the whole analysis.
Deep parsing, on the other hand, tries to build a complete syntactic tree, where one wrong POS tag can break the entire parse.
Why shallow parsing is less sensitive to POS tagging errors?
Shallow parsing (chunking) groups words into flat chunks like NP (noun phrase), VP (verb phrase), etc. It uses local POS patterns.
If one POS tag is wrong, the damage is local, the chunk may still be mostly correct, and the neighboring chunks remain unaffected. Error does not propagate much.
Why deep parsing is more sensitive to POS tagging errors?
Deep parsing (full syntactic parsing) builds a hierarchical parse tree with dependencies between words. POS tags determine the Phrase boundaries, Head–dependent relations, and Overall sentence structure.
If a POS tag is wrong the parser may choose the wrong grammar rule, fail to build a valid tree, and may produce a completely incorrect parse. Error propagates through the entire tree.
Example:
In the sentence "The can rusts quickly", if the word "can" is wrongly tagged as a VERB instead of a NOUN,
Shallow parsing: Might still form a rough NP or VP and the error affects only one chunk.
Deep parsing: Subject–verb structure breaks and the whole sentence tree becomes invalid or wrong.
10.
Which of the following tasks lies just beyond the scope of shallow parsing?
Correct Answer: C
Shallow parsing cannot resolve subject-object dependencies. To resolve subject-object dependencies, it requires knowing who is the subject and who is the object, it needs syntactic relations across phrases.
In simpler terms, shallow parsing identifies flat phrase boundaries such as NP, VP, and PP, but does not determine grammatical relations like subject–object dependencies, which require deep syntactic analysis.
✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.
☰ Quick Links - Browse Related MCQs
๐จ Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Quiz Mode:
1.
A typical router in the core of the Internet has: [University of Calgary, CPSC 441: Computer Networks, December 2018 - Final exam answers]
Correct Answer: E
A typical core Internet router must have:
Multiple interfaces with IP addresses
A switching fabric
Buffers for packet queuing
Scheduling algorithms for packet transmission
Why all options are correct?
(a) Multiple network interfaces, each with its own IP address
A core router connects to many different networks or links (often high-speed fiber links).
Each network interface (port) operates as a separate network endpoint and therefore must
have its own IP address. This allows the router to correctly send, receive, and route
packets on each connected link.
Hence, this statement is true.
(b) A switching fabric between its input ports and output ports
Inside a router, packets arriving on input ports must be forwarded to the correct output
ports. This internal data path is called the switching fabric.
It may be implemented using memory, a shared bus, or a crossbar switch.
The switching fabric enables high-speed packet transfer inside the router.
This is an essential internal component, so the statement is true.
(c) Buffers to hold a queue of packets at input or output ports
Packets can arrive faster than they can be transmitted, especially during periods of
congestion. To handle this, routers maintain buffers (queues) at:
Input ports (to wait for access to the switching fabric)
Output ports (to wait for link transmission)
Without buffers, packets would be dropped immediately during traffic bursts.
Therefore, this statement is true.
(d) A scheduling algorithm to determine which queued packet goes next
When multiple packets are waiting in a queue, the router needs a rule to decide which
packet is sent next. This is handled using scheduling algorithms.
Common scheduling algorithms include:
FIFO (First-In-First-Out)
Priority Scheduling
Round Robin
Weighted Fair Queuing (WFQ)
These algorithms help manage fairness, delay, and Quality of Service (QoS).
Hence, this statement is true.
2.
Consider the circuit-switched network shown in the figure below, with circuit switches A, B, C, and D. Suppose there are 14 circuits between A and B, 10 circuits between B and C, 13 circuits between C and D, and 11 circuits between D and A. What is the maximum number of connections that can be ongoing in the network at any one time? [University of Massachusetts Amherst, Computer Networking]
Correct Answer: D
The maximum number of connections that can be ongoing in the network at any one time is 48
Given Network
The circuit-switched network consists of four switches A, B, C, and D
connected in a square topology.
The number of circuits on each link is:
A ↔ B has 14 circuits, B ↔ C has 10 circuits,
C ↔ D has 13 circuits, and D ↔ A has 11 circuits.
Each circuit can support only one connection at a time.
Key Idea
In a circuit-switched network, each connection occupies one circuit on every link it uses.
To maximize the total number of simultaneous connections, the network should favor
single-link connections instead of multi-hop paths.
This ensures that circuits are not wasted on multiple links for the same connection.
Maximum Number of Ongoing Connections
Since all links can be used independently, the number of ongoing connections equals
the total number of available circuits:
Which of the following is the primary cause of packet jitter in a packet-switched network? [University of Michigan, EECS489 Computer Networks, Winter 2007 - Final exam answers]
Correct Answer: B
Explanation:
Packet jitter refers to the variation in packet delay. The main cause is variable queuing delay at routers and switches, which occurs when network congestion changes over time. Different packets experience different waiting times, leading to jitter.
Other causes of packet jitter can be due to cross traffic, bursty behavior of the end-user or end-host, etc.
4.
“Layering” is commonly used in computer networks because: [Stanford University, CS244a An Introduction to Computer Networks, March 2007 - Final exam answers]
Correct Answer: C
Layering is used in computer networks because it simplifies design, promotes interoperability, and enables reuse of protocol implementations.
Explanation: Why layering is used?
Layering divides the complex task of network communication into separate, well-defined layers, where each layer has a specific responsibility (e.g., physical transmission, routing, transport reliability). This design brings major software engineering benefits such as:
Each layer can be developed independently
Implementations can be reused across different systems
Changes in one layer do not break other layers
Standard interfaces promote interoperability
5.
In an Ethernet network, which of the following are true: [Stanford University, CS244a An Introduction to Computer Networks, March 2007 - Final exam answers]
Correct Answer: D
Ethernet switches learn MAC addresses by observing the source address of incoming frames, while hubs do no learning at all.
Explanation:
In Ethernet networks, switches (also called bridges) maintain a MAC address table (forwarding table).
This table maps: MAC address to switch port.
The switch learns this mapping by observing the source MAC address of incoming frames and recording which port they arrived on.
Why the Other Options Are Incorrect
(a) Ethernet switches do not learn addresses from the destination
MAC address. The destination address is used only to decide where a frame should
be forwarded. Learning requires knowing where a device is located, which can be
determined reliably only from the source MAC address.
(b) Ethernet hubs and repeaters operate at the physical layer and do not
inspect frame headers. They do not maintain MAC address tables and simply forward
incoming signals to all ports. Therefore, they are incapable of learning addresses.
(c) A correctly functioning Ethernet switch may send a frame to multiple
ports when the destination address is unknown or when the frame is a broadcast or
multicast. This behavior is normal and does not indicate incorrect operation.
Welcome to a comprehensive Machine Learning MCQ practice hub designed for students, job aspirants, competitive exam takers, and interview candidates looking to strengthen their conceptual understanding and problem-solving skills in core ML topics. This curated set of Multiple Choice Questions (MCQs) focuses on foundational algorithms and models including Perceptron, Support Vector Machines (SVM), Clustering techniques, and Neural Networks — all of which form the backbone of modern Artificial Intelligence and Data Science.
Machine learning empowers computers to learn patterns and make predictions from data without explicit programming — a capability at the heart of applications like image recognition, natural language processing, and intelligent automation. Practicing MCQs helps you reinforce key ideas such as linear and non-linear classification boundaries, maximum margin optimization in SVMs, unsupervised grouping in clustering, and layered function approximation in neural networks — sharpening your exam readiness and coding intuition.
Whether you are preparing for semester exams, GATE, university viva-voce tests, technical interviews, or certification quizzes, these machine learning MCQs with detailed answers will guide your conceptual clarity, analytical thinking, and practical exam performance — making complex algorithms approachable and memorable.
1.
Consider a Perceptron that has two input units and one output unit, which uses
an LTU activation function, plus a bias input of +1 and a bias weight w3 = 1. If both
inputs associated with an example are 0 and both weights, w1 and w2, connecting the
input units to the output unit have value 1, and the desired (teacher) output value is 0,
how will the weights change after applying the Perceptron Learning rule with learning
rate parameter ฮฑ = 1? [University of Wisconsin–Madison, CS540-2: Introduction to Artificial Intelligence, May 2018 - Final exam answers]
A Perceptron is the simplest type of artificial neural network and is used
for binary classification problems. It works like a decision-making unit
that takes multiple inputs, multiplies each input by a weight, adds a bias, and then
produces an output.
Mathematically, the perceptron computes a weighted sum of inputs and passes it through
an activation function:
Perceptron Weight Update Using the Perceptron Learning Rule - Answer explained
After applying the Perceptron Learning Rule, the updated weights are:
w1 = 1
w2 = 1
w3 = 0
Explanation: Since both input values are zero, the input weights remain unchanged.
The perceptron incorrectly produced an output of 1, so the bias weight is reduced to
lower the net input in future predictions.
2.
Consider a dataset containing six one-dimensional points: {2, 4, 7, 8, 12, 14}. After three iterations of Hierarchical Agglomerative Clustering using Euclidean distance between points, we get the 3 clusters: C1 = {2, 4}, C2 = {7, 8} and C3 = {12, 14}. What clusters are merged at the next iteration using Single Linkage? [University of Wisconsin–Madison, CS540: Introduction to Artificial Intelligence, October 2019 - Midterm exam answers]
Correct Answer: A
Merge Using Single Linkage in Hierarchical Clustering
In Single Linkage hierarchical clustering, the distance between two
clusters is defined as the minimum distance between any pair of points,
one from each cluster.
The smallest inter-cluster distance is
d(C1, C2) = 3.
Therefore, using Single Linkage, the clusters
C1 and C2 are merged in the next iteration.
Resulting cluster: {2, 4, 7, 8}
3.
Which of the following are true of support vector machines? [University of California at Berkeley, CS189: Introduction to Machine Learning, Spring 2019 - Final exam answers]
Correct Answer: A
What does the hyperparameter C mean in SVM?
In a soft-margin Support Vector Machine, the hyperparameter C controls the trade-off between:
Maximizing the margin (simpler model)
Minimizing classification error on training data
Explanation of each option
Option A — TRUE
Increasing the hyperparameter C penalizes misclassified training points
more heavily, forcing the SVM to fit the training data more accurately.
➜ Training error generally decreases.
Option B — FALSE
Hard-margin SVM allows no misclassification and corresponds to
C → ∞, not C = 0.
➜ With C = 0, misclassification is not penalized.
Option C — FALSE
Increasing C makes the classifier fit the training data more strictly.
➜ Training error decreases, not increases.
Option D — FALSE
A large C forces the decision boundary to accommodate even outliers.
➜ Sensitivity to outliers increases, not decreases.
Final Answer: Only Option A is true.
Exam Tip: Think of C as the cost of misclassification.
High C → low training error but high sensitivity to outliers.
4.
Which of the following might be valid reasons for preferring an SVM over a neural network? [Indian Institute of Technology Delhi, ELL784: Introduction to Machine Learning, 2017 - 18 - Exam answers]
Correct Answer: B
Kernel SVMs can implicitly operate in infinite-dimensional feature spaces via the kernel trick, while neural networks have finite-dimensional parameterizations.
Option (b): An SVM can effectively map the data to an infinite-dimensional space; a neural net cannot.
The key idea here comes from the kernel trick. Kernel-based SVMs
(such as those using the RBF kernel) implicitly operate in an
infinite-dimensional Hilbert space.
This mapping is done implicitly, without explicitly computing features.
The number of learned parameters does not grow with the feature space.
The optimization problem remains convex, guaranteeing a global optimum.
In contrast, neural networks:
Operate in finite-dimensional parameter spaces (finite neurons and weights).
Do not truly optimize over an infinite-dimensional feature space.
Require explicit architectural growth to approximate higher complexity.
SVMs can exactly work in infinite-dimensional feature spaces via kernels,
whereas neural networks can only approximate such mappings using finite architectures.
Why other options are INCORRECT?
Option (a) — Incorrect:
Neural networks can also learn non-linear transformations through hidden layers and activation functions.
Option (c) — Incorrect:
Unlike neural networks, SVMs solve a convex optimization problem and do not get stuck in local minima.
Option (d) — Incorrect:
The implicit feature space created by SVM kernels is typically harder—not easier—to interpret than neural network representations.
5.
Suppose that you are training a neural network for classification, but you notice
that the training loss is much lower than the validation loss. Which of the following is the most appropriate
way to address this issue? [Stanford University, CS224N: Natural Language Processing with Deep Learning
Winter 2018 - Midterm exam answers]
Correct Answer: C
What does "training loss is much lower than the validation loss" mean?
A large gap between training and validation loss is a strong indicator of overfitting, where the model has low bias but high variance.
When the training loss is much lower than the validation loss, it means:
The model is learning the training data too well, including noise and minor patterns.
It fails to generalize to unseen data (validation set).
In other words, the network performs well on seen data but poorly on new data.
Why this happens
The model is too complex (too many layers or neurons).
Limited training data to learn generalized patterns.
Training for too many epochs, allowing memorization of the training set.
Explanation: Why option C is correct?
A much lower training loss compared to validation loss indicates
overfitting. Increasing the L2 regularization weight
penalizes large model weights, discourages overly complex decision boundaries,
and improves generalization to unseen data.
Why the other options are incorrect
Option A — Incorrect:
Decreasing dropout reduces regularization and typically worsens overfitting.
Option B — Incorrect:
Increasing hidden layer size increases model capacity, making overfitting more likely.
Option D — Incorrect:
Adding more layers increases complexity and usually amplifies overfitting.
Note: When training loss ≪ validation loss, think
regularization, simpler models, or more data.
6.
Traditionally, when we have a real-valued input attribute during decision-tree learning
we consider a binary split according to whether the attribute is above or below some
threshold. Pat suggests that instead we should just have a multiway split with one
branch for each of the distinct values of the attribute. From the list below choose the
single biggest problem with Pat’s suggestion: [Carnegie Mellon University, 10-701/15-781 Final, Fall 2003 - Final exam answers]
Correct Answer: C
How Decision Trees Handle Real-Valued Attributes
Traditional Approach (Binary Split)
For a real-valued attribute A, decision trees choose a threshold
t and split the data as:
A ≤ t
A > t
This approach groups nearby values together, allowing the model to learn general
patterns while keeping the decision tree simple and robust.
Pat’s Suggestion
Pat proposes using a multiway split, with one branch for each
distinct value of the real-valued attribute.
If the attribute has many unique values (which is very common for real-valued data),
this would create many branches—potentially one branch per training example.
What Goes Wrong?
1. Perfect Memorization of Training Data
Each training example can end up in its own branch
Leaf nodes become extremely “pure”
The decision tree effectively memorizes the training set
๐ This usually results in very high (sometimes perfect) training accuracy.
2. Very Poor Generalization
Test data often contains values not seen during training
Even very close numeric values are treated as completely different
The model cannot generalize across ranges of values
๐ This leads to poor performance on the test set.
Why Option (iii) Is the Biggest Problem
Option (i) Too computationally expensive ❌
Multiway splits increase complexity, but learning is still feasible and this is not the main issue.
Option (ii) Bad on both training and test ❌
Incorrect, because training performance is usually very good.
Option (iii) Good on training, bad on test ✅ (Correct)
This is a classic case of overfitting, where the model learns noise and exact values instead of true patterns.
Option (iv) Good on test, bad on training ❌
Highly unlikely for a decision tree with this much flexibility.
Final Conclusion
Pat’s approach causes severe overfitting:
Excellent training accuracy
Poor generalization to unseen data
Therefore, the correct answer is:
(iii) It would probably result in a decision tree that scores well on the training set
but badly on a test set.
7.
For a neural network, which one of these structural assumptions is the one that most
affects the trade-off between underfitting (i.e. a high bias model) and overfitting (i.e.
a high variance model): [Carnegie Mellon University, 10-701/15-781 Final, Fall 2003 - Final exam answers]
Correct Answer: A
The question is about model capacity / complexity, which directly controls
the bias–variance trade-off.
Key Concept: Bias–Variance Trade-off
High Bias (Underfitting)
Model is too simple
Cannot capture underlying patterns
High Variance (Overfitting)
Model is too complex
Fits noise in the training data
The main factor controlling this trade-off is how expressive the model is.
Evaluating Each Option
Option (i) The Number of Hidden Nodes (Correct)
It determines how many parameters the network has
It controls how complex a function the network can represent
Few hidden nodes → simple model → high bias (underfitting)
Many hidden nodes → complex model → high variance (overfitting)
Overall, this directly controls the bias–variance trade-off.
Option (ii) The Learning Rate ❌
Affects training speed and convergence stability but does not change the model’s capacity
or bias–variance behavior.
Option (iii) The Initial Choice of Weights ❌
Influences which local minimum is reached, but not the network structure or overall model
complexity.
Option (iv) The Use of a Constant-Term Unit Input (Bias Unit) ❌
Allows shifting of activation functions, but has only a minor effect compared to the number
of hidden nodes.
8.
Select the true statements about k-means clustering. Assume no two sample points are equal. [University of California at Berkeley, CS189: Introduction to Machine Learning, Spring 2025 - Final exam answers]
Correct Answer: D
The question asks you to select the true statements about k-means clustering, specifically about Lloyd’s Algorithm, which is the standard algorithm used to solve k-means.
k-means is greedy, initialization-dependent, centroid-based, and increasing k never increases the optimal cost.
Answer explanation:
Key assumptions given:
No two sample points are equal (this avoids tie cases but does not change the main conclusions).
We are reasoning about the k-means objective (cost) function, usually the sum of squared distances from points to their assigned cluster centroids.
Correct Answer: D
Increasing the number of clusters k can never increase the
global minimum of the k-means cost function.
Why this is true:
When k increases, the algorithm has more freedom
to place centroids closer to the data points.
The optimal cost for k + 1 clusters is
never worse than for k clusters.
In the worst case, we could reuse the same clustering as before.
Since the number of data points is greater than the number of clusters,
each cluster can contain at least one point, so the objective function
remains valid.
Formally:
J*(k + 1) ≤ J*(k)
In the extreme case:
k = n (one cluster per point) → each point is its own centroid →
cost = 0.
Note: This statement is about the global minimum,
not the solution found by Lloyd’s algorithm in practice, which may get stuck
in a local minimum.
Why other options are wrong?
Option A: Lloyd’s finds local minima, not global ❌
Option B: Average linkage ≠ k-means ❌
Option C: Initialization affects results ❌
9.
For very large training data sets, which of the following will usually have the lowest training time? [University of Pennsylvania, CIS 520: Machine Learning, 2019 - Final exam answers]
Correct Answer: C
“KNN has almost zero training cost because it does not learn a model; it only stores the data.”
Option-by-option explanation
❌ Logistic Regression: Training involves iterative optimization (gradient descent, Newton methods). Cost per iteration is ๐(๐⋅๐). Needs many passes over the data. Not the fastest for very large datasets.
❌ Neural Networks: Training requires multiple epochs and backpropagation. Computationally very expensive. Training time increases rapidly with: Data size, Number of layers. and Number of neurons. One of the slowest to train.
✅ K-Nearest Neighbors (KNN): KNN has essentially no training phase. Training consists of simply storing the dataset in memory. No optimization, no model fitting. Training time is approximately O(1) (or linear time to store data). Lowest training time, especially for very large datasets.
⚠️ Note: Prediction time is expensive, but that is not asked here
❌ Random Forests: Training involves building many decision trees. Each tree performs recursive splits. Training cost grows quickly with number of trees, depth of trees, and dataset size. Slow to train on very large datasets.
10.
In building a linear regression model for a particular data set, you observe the coe cient of one of the features having a relatively high negative value. This suggests that [Indian Institute of Technology Madras (IITM), Introduction to Machine Learning, Quiz answers]
Correct Answer: C
In linear regression, coefficient magnitude alone does not determine feature importance unless features are on comparable scales.
A high magnitude suggests that the feature is important. However, it may be the case that another feature is highly correlated with this feature and its coefficient also has a high magnitude with the opposite sign, in effect cancelling out the effect of the former. Thus, we cannot really remark on the importance of a feature just because its coefficient has a relatively large magnitude.
Why other options are wrong?
Option A: The magnitude of a coefficient alone is misleading. Without knowing feature scaling, units, correlation with other features, regularization used etc., you cannot conclude that the feature has a “strong effect”. ❌
Option B: A high-magnitude coefficient (even negative) indicates that the model is sensitive to that feature. Ignoring the feature based only on the sign or raw magnitude is unjustified. ❌
Frequently Asked Questions (Machine Learning)
What does it mean when training loss is much lower than validation loss?
When training loss is much lower than validation loss, it indicates that the model is overfitting.
The model has learned the training data very well, including noise, but fails to generalize to unseen data.
This usually occurs due to high model complexity or insufficient regularization.
Why does using a multiway split for real-valued attributes in decision trees cause problems?
Using a multiway split for real-valued attributes creates many branches, often one per unique value.
This leads to overfitting, where the decision tree performs very well on training data but poorly on test data
because the splits capture noise rather than general patterns.
Which machine learning algorithm has the lowest training time for very large datasets?
K-nearest neighbors (KNN) usually has the lowest training time because it does not learn an explicit model.
Training simply involves storing the data, while most computation happens during prediction.
Does Lloyd’s algorithm for k-means clustering find the global minimum?
No, Lloyd’s algorithm does not guarantee finding the global minimum of the k-means objective function.
It converges to a local minimum that depends on the initial choice of cluster centroids.
Does increasing the number of clusters (k) in k-means always reduce the cost function?
Increasing the number of clusters cannot increase the global minimum of the k-means cost function
as long as the number of data points is greater than the number of clusters.
The cost is non-increasing as k increases because additional clusters allow equal or better fitting of the data.
How should a large negative coefficient be interpreted in linear regression?
A large negative coefficient indicates that the feature is negatively correlated with the target variable.
However, the magnitude alone does not determine feature importance unless features are on comparable scales.
Additional information such as feature normalization is required.
How does increasing the number of hidden nodes affect bias and variance?
Increasing the number of hidden nodes generally reduces bias but increases variance.
While a more complex model can better fit the training data, it also becomes more prone to overfitting.
Why is feature scaling important when interpreting linear model coefficients?
Feature scaling makes coefficients comparable across features in linear models.
Without scaling, features measured in smaller units may appear more important due to larger coefficient values,
even if their true effect on the target variable is small.