Major links



Quicklinks


πŸ“Œ Quick Links
[ DBMS ] [ SQL ] [ DDB ] [ ML ] [ DL ] [ NLP ] [ DSA ] [ PDB ] [ DWDM ] [ Quizzes ]


Showing posts with label Machine Learning Quiz. Show all posts
Showing posts with label Machine Learning Quiz. Show all posts

Friday, January 30, 2026

10 University-Level MCQs on Linear Regression, Ridge, Lasso & Multicollinearity (With Answers)

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

☰ Quick Links - Browse Related MCQs
🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Quiz Mode:

Introduction

Linear regression is one of the most fundamental models in statistics and machine learning, forming the basis for many advanced methods used in data science and artificial intelligence. Despite its simplicity, a deep understanding of linear regression involves important theoretical concepts such as ordinary least squares (OLS), regularization techniques, multicollinearity, optimization methods, and statistical hypothesis testing.

In this article, we present a curated set of advanced multiple-choice questions (MCQs) on linear regression and its variants, inspired by exam-style questions from top universities such as UC Berkeley, Stanford, MIT, CMU, and ETH ZΓΌrich. These questions are designed to test not only computational knowledge, but also conceptual clarity and theoretical reasoning.

Each question is accompanied by a clear explanation of the correct answer, making this resource especially useful for machine learning students, data science learners, and candidates preparing for university exams, interviews, or competitive tests. Topics covered include ridge and lasso regression, feature duplication, multicollinearity, the F-test, optimization methods, and the relationship between linear regression and neural networks.

Whether you are revising core concepts or aiming to strengthen your theoretical foundations, these MCQs will help you identify common misconceptions and develop a deeper understanding of linear regression models.

1.
Ridge regression can be used to address the problem of: [ETH, Zurich, Advanced Machine Learning, January 2021 - Final exam answers]






Correct Answer: B

Ridge regression introduces an L2 regularization term that penalizes large coefficients.

This helps reduce model variance and stabilizes coefficient estimates when predictors are highly correlated (multicollinearity).

As a result, ridge regression effectively controls overfitting in linear regression models without eliminating features.

2.
Compared with LASSO, ridge regression is more likely to end up with higher sparsity of coefficients. This statement is: [ETH, Zurich, Advanced Machine Learning, January 2021 - Final exam answers]






Correct Answer: C

LASSO (L1 regularization) can force some coefficients to become exactly zero, resulting in higher sparsity and implicit feature selection.

In contrast, ridge regression (L2 regularization) only shrinks coefficients toward zero but rarely makes them exactly zero.

Therefore, ridge regression typically produces less sparse solutions compared to LASSO.

3.
Which of the following is an advantage of Linear Regression over k-Nearest Neighbors (k-NN)? [University of Toronto, Machine Learning and Data Mining, Fall 2018 - Midterm exam answers]






Correct Answer: B

Linear Regression learns a global linear function, making it faster, interpretable, and better at generalizing for linear patterns.

In contrast, k-NN is non-parametric, requires storing the full dataset, and can be slow and less interpretable.

4.
Which of the following statements about linear regression and regularization is TRUE? [University of California, CS 189/289A Introduction to Machine Learning, Spring 2025 - Midterm exam answers]






Correct Answer: A

Ridge regression has a quadratic, convex cost function, so Newton’s Method reaches the global minimum in a single step.

Ridge regression has the following cost function:

J(w) = ∥Xw − y∥2 + λ∥w∥22

This cost function is a quadratic and convex function in w.

For a quadratic function:

  • Newton’s method reaches the global minimum in one step, regardless of the starting point.

What is Newton's method in this context?

Newton’s method is an optimization algorithm used to find the minimum of a cost function. In linear / ridge regression, we use it to find the best weights 𝑀 that minimize the loss.

Newton’s method says: “Use both the slope (gradient) and the curvature (second derivative) of the cost function to jump directly toward the minimum.”

In simple terms; Gradient descent → walking downhill step by step. Newton’s method → jumping straight to the bottom if the surface is simple

Why other options are wrong?

  • Option B: Validation error may increase. Adding quadratic features: Increases model complexity and always lowers training error. But on the validation set: It may overfit and validation cost can increase. The word "always" makes this option wrong.
  • Option C: Newton's method required second derivatives and Lasso is non-differentiable. Newton’s method cannot be applied directly, and certainly not in one step.
  • Option D: Lasso uses L1 not L2.

Note: In practice, ridge regression is usually solved analytically rather than iteratively, but Newton’s method converges in one step due to the quadratic objective.

5.
Which of the following statements best explains why linear regression can be considered a special case of a neural network? [Carnegie Mellon University, 10-701 Machine Learning, Fall 2018 - Midterm exam answers - Modified Question]






Correct Answer: D

Linear regression can be viewed as a neural network with a single output neuron, no hidden layers, and a linear (identity) activation function, making it a special case of neural networks.

Mathematical Equivalence of Linear Regression and Neural Network

Linear Regression:

y = wTx + b

Single-Neuron Neural Network (No Activation):

y = wTx + b

They are identical: the same equation, the same parameters, and the same predictions.

Hence, linear regression is a special (degenerate) case of a neural network.

6.
When selecting additional input points x to label for a linear regression model, why is it desirable to choose points that are as spread out as possible? [University of Pennsylvania, CIS520 Machine Learning, 2019 – Final Exam (Modified)]






Correct Answer: A

Points that are far from existing data have higher leverage and provide new geometric information about the regression line. Labeling such points improves the conditioning of XTX, thereby reducing variance in parameter estimates.

In simple terms, nearby points mostly repeat existing information, while distant points teach the model something new.

From a statistical perspective, the covariance of the least-squares estimator is σ2(XTX)−1; spread-out points improve the conditioning of XTX, thereby reducing parameter variance.

7.
Consider a linear regression model. Suppose one of the input features is duplicated (i.e., an identical copy of a feature is added to the design matrix). Which of the following statements are correct? [University of California at Berkeley, CS189 Introduction to Machine Learning, 2015 – Final Exam (Modified)]






Correct Answer: B

What does “duplicating a feature” mean?

Suppose your original linear regression model is:

y = w1x1 + Ξ΅

Now, suppose you duplicate the feature x1, meaning the same feature appears twice as identical columns in the design matrix. The model then becomes:

y = w1x1 + w2x1 + Ξ΅

This can be rewritten as:

(w1 + w2)x1

Thus, the prediction capacity of the model remains unchanged. Duplicating a feature does not add new information; it merely allows the same effect to be distributed across multiple weights.

Explanation

Option A (Incorrect):
In ridge regression, the penalty term is Ξ» ∑ wi2. When a feature is duplicated, its weight can be split across two identical features. This reduces the L2 penalty while keeping the model’s predictions unchanged.

Option B (Correct):
The Residual Sum of Squares (RSS) depends only on the model’s predictions. Duplicating a feature does not introduce any new information, so the minimum RSS remains unchanged.

Option C (Incorrect):
In lasso regression, the penalty term is Ξ» ∑ |wi|. Splitting a weight across duplicated features does not reduce the L1 penalty, since lasso encourages sparsity and prefers assigning the entire weight to a single feature.

8.
In a multiple linear regression model y = Ξ²₀ + Ξ²₁x₁ + Ξ²₂x₂ + ··· + Ξ²β‚–xβ‚– + Ξ΅, which of the following statements is TRUE? [Stanford University, CS229 Machine Learning – Exam Style]






Correct Answer: B

What is Ordinary Least Squares (OLS)?

Ordinary Least Squares (OLS) is a method used in linear regression to estimate the model parameters (coefficients). It chooses the coefficients so that the sum of squared differences between the observed values and the predicted values is as small as possible.

Why Option B is correct

OLS estimates the regression coefficients by minimizing the sum of squared residuals, i.e.,

RSS = ∑i=1n (yi − Ε·i)2

This is the defining property of OLS. The solution is obtained by solving a convex optimization problem, which guarantees that the coefficients chosen minimize the residual sum of squares.

Why the other options are incorrect

Option A (Incorrect):
Adding more predictors does not guarantee better predictive accuracy. Although training error may decrease, test error can increase due to overfitting.

Option C (Incorrect):
Multicollinearity makes coefficient estimates unstable and difficult to interpret. It inflates the variance of the estimated regression coefficients.

Option D (Incorrect):
OLS estimates are biased if any regressor is correlated with the error term, which violates the exogeneity assumption.

9.
In multiple linear regression, what is the main purpose of the F-test? [Introductory Statistics / Machine Learning – Exam Style]






Correct Answer: C

Explanation:

A significant F-test indicates that the predictors jointly explain variation in the response variable, meaning at least one predictor is useful.

The F-test in multiple linear regression checks whether the regression model as a whole is useful. It tests whether at least one regression coefficient (excluding the intercept) is non-zero.


If the F-test is significant, it indicates that the predictors jointly help explain variation in the response variable.

10.
Which of the following is a direct consequence of severe multicollinearity in a multiple linear regression model? [University of California, Berkeley – CS189 Introduction to Machine Learning]






Correct Answer: C

Multicollinearity occurs when two or more predictors in a multiple linear regression model are highly correlated.

In such cases:

  • The matrix XTX becomes nearly singular.
  • Its inverse (XTX)−1 contains very large values.
  • As a result, the variance of the OLS coefficient estimates increases.

Formally, the covariance matrix of OLS estimates is:

Var(Ξ²̂) = Οƒ2(XTX)−1

When predictors are highly correlated, (XTX)−1 explodes, leading to unstable coefficient estimates.

Why the other options are incorrect

Option A (Incorrect):
Multicollinearity does not reduce standard errors; it increases them.

Option B (Incorrect):
High multicollinearity reduces statistical significance because inflated variances lead to larger p-values.

Option D (Incorrect):
Multicollinearity does not introduce bias into OLS estimates. OLS remains unbiased as long as regressors are uncorrelated with the error term.

Wednesday, January 14, 2026

Top Machine Learning MCQs with Detailed Answers | Perceptron, SVM, Clustering & Neural Networks

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

☰ Quick Links - Browse Related MCQs
🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Quiz Mode:

Introduction: Machine Learning MCQs — Perceptron, SVM, Clustering & Neural Networks

Welcome to a comprehensive Machine Learning MCQ practice hub designed for students, job aspirants, competitive exam takers, and interview candidates looking to strengthen their conceptual understanding and problem-solving skills in core ML topics. This curated set of Multiple Choice Questions (MCQs) focuses on foundational algorithms and models including Perceptron, Support Vector Machines (SVM), Clustering techniques, and Neural Networks — all of which form the backbone of modern Artificial Intelligence and Data Science.

Machine learning empowers computers to learn patterns and make predictions from data without explicit programming — a capability at the heart of applications like image recognition, natural language processing, and intelligent automation. Practicing MCQs helps you reinforce key ideas such as linear and non-linear classification boundaries, maximum margin optimization in SVMs, unsupervised grouping in clustering, and layered function approximation in neural networks — sharpening your exam readiness and coding intuition.

Whether you are preparing for semester exams, GATE, university viva-voce tests, technical interviews, or certification quizzes, these machine learning MCQs with detailed answers will guide your conceptual clarity, analytical thinking, and practical exam performance — making complex algorithms approachable and memorable.


1.
Consider a Perceptron that has two input units and one output unit, which uses an LTU activation function, plus a bias input of +1 and a bias weight w3 = 1. If both inputs associated with an example are 0 and both weights, w1 and w2, connecting the input units to the output unit have value 1, and the desired (teacher) output value is 0, how will the weights change after applying the Perceptron Learning rule with learning rate parameter Ξ± = 1? [University of Wisconsin–Madison, CS540-2: Introduction to Artificial Intelligence, May 2018 - Final exam answers]







Correct Answer: D

What is perceptron?

A Perceptron is the simplest type of artificial neural network and is used for binary classification problems. It works like a decision-making unit that takes multiple inputs, multiplies each input by a weight, adds a bias, and then produces an output.

Mathematically, the perceptron computes a weighted sum of inputs and passes it through an activation function:

Perceptron Weight Update Using the Perceptron Learning Rule - Answer explained

Given:

  • Inputs: x1 = 0, x2 = 0
  • Bias input: x3 = +1
  • Initial weights: w1 = 1, w2 = 1, w3 = 1
  • Learning rate (Ξ±) = 1
  • Desired (teacher) output: t = 0
  • Activation function: Linear Threshold Unit (LTU)

Step 1: Net Input Calculation

net = w1x1 + w2x2 + w3x3
net = (1 × 0) + (1 × 0) + (1 × 1) = 1

Step 2: Actual Output

Since net ≥ 0, the LTU output is:
y = 1

Step 3: Error Calculation

error = t − y = 0 − 1 = −1

Step 4: Weight Update (Perceptron Learning Rule)

winew = wi + Ξ±(t − y)xi

Updated weights:

  • w1new = 1 + (1)(−1)(0) = 1
  • w2new = 1 + (1)(−1)(0) = 1
  • w3new = 1 + (1)(−1)(1) = 0

Final Answer

After applying the Perceptron Learning Rule, the updated weights are:

  • w1 = 1
  • w2 = 1
  • w3 = 0

Explanation: Since both input values are zero, the input weights remain unchanged. The perceptron incorrectly produced an output of 1, so the bias weight is reduced to lower the net input in future predictions.

2.
Consider a dataset containing six one-dimensional points: {2, 4, 7, 8, 12, 14}. After three iterations of Hierarchical Agglomerative Clustering using Euclidean distance between points, we get the 3 clusters: C1 = {2, 4}, C2 = {7, 8} and C3 = {12, 14}. What clusters are merged at the next iteration using Single Linkage? [University of Wisconsin–Madison, CS540: Introduction to Artificial Intelligence, October 2019 - Midterm exam answers]







Correct Answer: A

Merge Using Single Linkage in Hierarchical Clustering

In Single Linkage hierarchical clustering, the distance between two clusters is defined as the minimum distance between any pair of points, one from each cluster.

Given Clusters

  • C1 = {2, 4}
  • C2 = {7, 8}
  • C3 = {12, 14}

Inter-Cluster Distance Calculations

Distance between C1 and C2:

min{|2 − 7|, |2 − 8|, |4 − 7|, |4 − 8|} = min{5, 6, 3, 4} = 3

Distance between C2 and C3:

min{|7 − 12|, |7 − 14|, |8 − 12|, |8 − 14|} = min{5, 7, 4, 6} = 4

Distance between C1 and C3:

min{|2 − 12|, |2 − 14|, |4 − 12|, |4 − 14|} = min{10, 12, 8, 10} = 8

Conclusion

The smallest inter-cluster distance is d(C1, C2) = 3. Therefore, using Single Linkage, the clusters C1 and C2 are merged in the next iteration.

Resulting cluster: {2, 4, 7, 8}

3.
Which of the following are true of support vector machines? [University of California at Berkeley, CS189: Introduction to Machine Learning, Spring 2019 - Final exam answers]







Correct Answer: A

What does the hyperparameter C mean in SVM?

In a soft-margin Support Vector Machine, the hyperparameter C controls the trade-off between:

  • Maximizing the margin (simpler model)
  • Minimizing classification error on training data

Explanation of each option

Option A — TRUE
Increasing the hyperparameter C penalizes misclassified training points more heavily, forcing the SVM to fit the training data more accurately.
➜ Training error generally decreases.
Option B — FALSE
Hard-margin SVM allows no misclassification and corresponds to C → ∞, not C = 0.
➜ With C = 0, misclassification is not penalized.
Option C — FALSE
Increasing C makes the classifier fit the training data more strictly.
➜ Training error decreases, not increases.
Option D — FALSE
A large C forces the decision boundary to accommodate even outliers.
➜ Sensitivity to outliers increases, not decreases.

Final Answer: Only Option A is true.

Exam Tip: Think of C as the cost of misclassification. High C → low training error but high sensitivity to outliers.

4.
Which of the following might be valid reasons for preferring an SVM over a neural network? [Indian Institute of Technology Delhi, ELL784: Introduction to Machine Learning, 2017 - 18 - Exam answers]







Correct Answer: B
Kernel SVMs can implicitly operate in infinite-dimensional feature spaces via the kernel trick, while neural networks have finite-dimensional parameterizations.

Option (b):
An SVM can effectively map the data to an infinite-dimensional space; a neural net cannot.

The key idea here comes from the kernel trick. Kernel-based SVMs (such as those using the RBF kernel) implicitly operate in an infinite-dimensional Hilbert space.

  • This mapping is done implicitly, without explicitly computing features.
  • The number of learned parameters does not grow with the feature space.
  • The optimization problem remains convex, guaranteeing a global optimum.

In contrast, neural networks:

  • Operate in finite-dimensional parameter spaces (finite neurons and weights).
  • Do not truly optimize over an infinite-dimensional feature space.
  • Require explicit architectural growth to approximate higher complexity.

SVMs can exactly work in infinite-dimensional feature spaces via kernels, whereas neural networks can only approximate such mappings using finite architectures.

Why other options are INCORRECT?

  • Option (a) — Incorrect: Neural networks can also learn non-linear transformations through hidden layers and activation functions.
  • Option (c) — Incorrect: Unlike neural networks, SVMs solve a convex optimization problem and do not get stuck in local minima.
  • Option (d) — Incorrect: The implicit feature space created by SVM kernels is typically harder—not easier—to interpret than neural network representations.
5.
Suppose that you are training a neural network for classification, but you notice that the training loss is much lower than the validation loss. Which of the following is the most appropriate way to address this issue? [Stanford University, CS224N: Natural Language Processing with Deep Learning Winter 2018 - Midterm exam answers]






Correct Answer: C

What does "training loss is much lower than the validation loss" mean?

A large gap between training and validation loss is a strong indicator of overfitting, where the model has low bias but high variance.

When the training loss is much lower than the validation loss, it means:

  • The model is learning the training data too well, including noise and minor patterns.
  • It fails to generalize to unseen data (validation set).
  • In other words, the network performs well on seen data but poorly on new data.

Why this happens

  • The model is too complex (too many layers or neurons).
  • Insufficient regularization (e.g., low dropout, weak L2 penalty).
  • Limited training data to learn generalized patterns.
  • Training for too many epochs, allowing memorization of the training set.

Explanation: Why option C is correct?
A much lower training loss compared to validation loss indicates overfitting. Increasing the L2 regularization weight penalizes large model weights, discourages overly complex decision boundaries, and improves generalization to unseen data.

Why the other options are incorrect

  • Option A — Incorrect: Decreasing dropout reduces regularization and typically worsens overfitting.
  • Option B — Incorrect: Increasing hidden layer size increases model capacity, making overfitting more likely.
  • Option D — Incorrect: Adding more layers increases complexity and usually amplifies overfitting.

Note: When training loss ≪ validation loss, think regularization, simpler models, or more data.

6.
Traditionally, when we have a real-valued input attribute during decision-tree learning we consider a binary split according to whether the attribute is above or below some threshold. Pat suggests that instead we should just have a multiway split with one branch for each of the distinct values of the attribute. From the list below choose the single biggest problem with Pat’s suggestion: [Carnegie Mellon University, 10-701/15-781 Final, Fall 2003 - Final exam answers]






Correct Answer: C

How Decision Trees Handle Real-Valued Attributes

Traditional Approach (Binary Split)

For a real-valued attribute A, decision trees choose a threshold t and split the data as:

  • A ≤ t
  • A > t

This approach groups nearby values together, allowing the model to learn general patterns while keeping the decision tree simple and robust.

Pat’s Suggestion

Pat proposes using a multiway split, with one branch for each distinct value of the real-valued attribute.

If the attribute has many unique values (which is very common for real-valued data), this would create many branches—potentially one branch per training example.

What Goes Wrong?

1. Perfect Memorization of Training Data

  • Each training example can end up in its own branch
  • Leaf nodes become extremely “pure”
  • The decision tree effectively memorizes the training set

πŸ‘‰ This usually results in very high (sometimes perfect) training accuracy.

2. Very Poor Generalization

  • Test data often contains values not seen during training
  • Even very close numeric values are treated as completely different
  • The model cannot generalize across ranges of values

πŸ‘‰ This leads to poor performance on the test set.

Why Option (iii) Is the Biggest Problem

  • Option (i) Too computationally expensive ❌
    Multiway splits increase complexity, but learning is still feasible and this is not the main issue.
  • Option (ii) Bad on both training and test ❌
    Incorrect, because training performance is usually very good.
  • Option (iii) Good on training, bad on test ✅ (Correct)
    This is a classic case of overfitting, where the model learns noise and exact values instead of true patterns.
  • Option (iv) Good on test, bad on training ❌
    Highly unlikely for a decision tree with this much flexibility.

Final Conclusion

Pat’s approach causes severe overfitting:

  • Excellent training accuracy
  • Poor generalization to unseen data

Therefore, the correct answer is:

(iii) It would probably result in a decision tree that scores well on the training set but badly on a test set.

7.
For a neural network, which one of these structural assumptions is the one that most affects the trade-off between underfitting (i.e. a high bias model) and overfitting (i.e. a high variance model): [Carnegie Mellon University, 10-701/15-781 Final, Fall 2003 - Final exam answers]






Correct Answer: A

The question is about model capacity / complexity, which directly controls the bias–variance trade-off.

Key Concept: Bias–Variance Trade-off

High Bias (Underfitting)

  • Model is too simple
  • Cannot capture underlying patterns

High Variance (Overfitting)

  • Model is too complex
  • Fits noise in the training data

The main factor controlling this trade-off is how expressive the model is.

Evaluating Each Option

Option (i) The Number of Hidden Nodes (Correct)

  • It determines how many parameters the network has
  • It controls how complex a function the network can represent
  • Few hidden nodes → simple model → high bias (underfitting)
  • Many hidden nodes → complex model → high variance (overfitting)

Overall, this directly controls the bias–variance trade-off.

Option (ii) The Learning Rate ❌

Affects training speed and convergence stability but does not change the model’s capacity or bias–variance behavior.

Option (iii) The Initial Choice of Weights ❌

Influences which local minimum is reached, but not the network structure or overall model complexity.

Option (iv) The Use of a Constant-Term Unit Input (Bias Unit) ❌

Allows shifting of activation functions, but has only a minor effect compared to the number of hidden nodes.

8.
Select the true statements about k-means clustering. Assume no two sample points are equal. [University of California at Berkeley, CS189: Introduction to Machine Learning, Spring 2025 - Final exam answers]






Correct Answer: D

The question asks you to select the true statements about k-means clustering, specifically about Lloyd’s Algorithm, which is the standard algorithm used to solve k-means.

k-means is greedy, initialization-dependent, centroid-based, and increasing k never increases the optimal cost.

Answer explanation:

Key assumptions given:
  • No two sample points are equal (this avoids tie cases but does not change the main conclusions).
  • We are reasoning about the k-means objective (cost) function, usually the sum of squared distances from points to their assigned cluster centroids.

Correct Answer: D

Increasing the number of clusters k can never increase the global minimum of the k-means cost function.

Why this is true:

  • When k increases, the algorithm has more freedom to place centroids closer to the data points.
  • The optimal cost for k + 1 clusters is never worse than for k clusters. In the worst case, we could reuse the same clustering as before.
  • Since the number of data points is greater than the number of clusters, each cluster can contain at least one point, so the objective function remains valid.

Formally:
J*(k + 1) ≤ J*(k)

In the extreme case: k = n (one cluster per point) → each point is its own centroid → cost = 0.

Note: This statement is about the global minimum, not the solution found by Lloyd’s algorithm in practice, which may get stuck in a local minimum.


Why other options are wrong?

  • Option A: Lloyd’s finds local minima, not global ❌
  • Option B: Average linkage ≠ k-means ❌
  • Option C: Initialization affects results ❌
9.
For very large training data sets, which of the following will usually have the lowest training time? [University of Pennsylvania, CIS 520: Machine Learning, 2019 - Final exam answers]






Correct Answer: C

“KNN has almost zero training cost because it does not learn a model; it only stores the data.”


Option-by-option explanation

  • ❌ Logistic Regression: Training involves iterative optimization (gradient descent, Newton methods). Cost per iteration is 𝑂(𝑛⋅𝑑). Needs many passes over the data. Not the fastest for very large datasets.
  • ❌ Neural Networks: Training requires multiple epochs and backpropagation. Computationally very expensive. Training time increases rapidly with: Data size, Number of layers. and Number of neurons. One of the slowest to train.
  • ✅ K-Nearest Neighbors (KNN): KNN has essentially no training phase. Training consists of simply storing the dataset in memory. No optimization, no model fitting. Training time is approximately O(1) (or linear time to store data). Lowest training time, especially for very large datasets. ⚠️ Note: Prediction time is expensive, but that is not asked here
  • ❌ Random Forests: Training involves building many decision trees. Each tree performs recursive splits. Training cost grows quickly with number of trees, depth of trees, and dataset size. Slow to train on very large datasets.
10.
In building a linear regression model for a particular data set, you observe the coe cient of one of the features having a relatively high negative value. This suggests that [Indian Institute of Technology Madras (IITM), Introduction to Machine Learning, Quiz answers]






Correct Answer: C

In linear regression, coefficient magnitude alone does not determine feature importance unless features are on comparable scales.

A high magnitude suggests that the feature is important. However, it may be the case that another feature is highly correlated with this feature and its coefficient also has a high magnitude with the opposite sign, in effect cancelling out the effect of the former. Thus, we cannot really remark on the importance of a feature just because its coefficient has a relatively large magnitude.


Why other options are wrong?

  • Option A: The magnitude of a coefficient alone is misleading. Without knowing feature scaling, units, correlation with other features, regularization used etc., you cannot conclude that the feature has a “strong effect”. ❌
  • Option B: A high-magnitude coefficient (even negative) indicates that the model is sensitive to that feature. Ignoring the feature based only on the sign or raw magnitude is unjustified. ❌

Frequently Asked Questions (Machine Learning)

What does it mean when training loss is much lower than validation loss?

When training loss is much lower than validation loss, it indicates that the model is overfitting. The model has learned the training data very well, including noise, but fails to generalize to unseen data. This usually occurs due to high model complexity or insufficient regularization.

Why does using a multiway split for real-valued attributes in decision trees cause problems?

Using a multiway split for real-valued attributes creates many branches, often one per unique value. This leads to overfitting, where the decision tree performs very well on training data but poorly on test data because the splits capture noise rather than general patterns.

Which machine learning algorithm has the lowest training time for very large datasets?

K-nearest neighbors (KNN) usually has the lowest training time because it does not learn an explicit model. Training simply involves storing the data, while most computation happens during prediction.

Does Lloyd’s algorithm for k-means clustering find the global minimum?

No, Lloyd’s algorithm does not guarantee finding the global minimum of the k-means objective function. It converges to a local minimum that depends on the initial choice of cluster centroids.

Does increasing the number of clusters (k) in k-means always reduce the cost function?

Increasing the number of clusters cannot increase the global minimum of the k-means cost function as long as the number of data points is greater than the number of clusters. The cost is non-increasing as k increases because additional clusters allow equal or better fitting of the data.

How should a large negative coefficient be interpreted in linear regression?

A large negative coefficient indicates that the feature is negatively correlated with the target variable. However, the magnitude alone does not determine feature importance unless features are on comparable scales. Additional information such as feature normalization is required.

How does increasing the number of hidden nodes affect bias and variance?

Increasing the number of hidden nodes generally reduces bias but increases variance. While a more complex model can better fit the training data, it also becomes more prone to overfitting.

Why is feature scaling important when interpreting linear model coefficients?

Feature scaling makes coefficients comparable across features in linear models. Without scaling, features measured in smaller units may appear more important due to larger coefficient values, even if their true effect on the target variable is small.

Tuesday, January 6, 2026

Choosing the Right Machine Learning Algorithm – Real-World MCQs with Answers

✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.

☰ Quick Links - Browse Related MCQs
🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.

Choosing the Right Machine Learning Algorithm – Real-World MCQs

Selecting the correct machine learning algorithm is a critical step in solving real-world data science problems. The choice depends on factors such as data type, problem objective, labeled vs unlabeled data, and output nature.

In this quiz, you will explore scenario-based MCQs using real-life datasets from domains such as real estate, e-commerce, banking, healthcare, recommendation systems, and time-series forecasting. These questions are commonly asked in university exams, ML interviews, and competitive tests.

Topics covered include:

  • Regression vs Classification problems
  • Supervised vs Unsupervised learning
  • Clustering and Customer Segmentation
  • Recommendation Systems
  • Time-Series Forecasting
  • Dimensionality Reduction

Each question includes difficulty level, data type, and clear explanations to help you understand why a particular ML algorithm is the best choice.

House Price Prediction (Bangalore Real Estate)
1. A real-estate company wants to predict house prices in Bangalore using features such as area (sq.ft), number of bedrooms, location, and age of the building. The target value is continuous.






Correct Answer: B
Difficulty: Easy
Data Type: Labeled, Continuous Target

Linear Regression is ideal for predicting continuous numerical values.

Why not others? Logistic Regression is for classification, K-Means is unsupervised, and Apriori is for association rules.

Email Spam Detection (Gmail-like System)
2. An email service like Gmail wants to classify emails as Spam or Not Spam using word frequencies and sender information.






Correct Answer: D
Difficulty: Easy
Data Type: Labeled, Text Data

Naive Bayes works well for probabilistic text classification problems.

Why not others? K-Means is unsupervised and PCA is for dimensionality reduction.

Customer Segmentation for Amazon
3. Amazon wants to group customers based on purchase history, spending behavior, and browsing activity for marketing purposes.






Correct Answer: B
Difficulty: Medium
Data Type: Unlabeled, Numerical Features

K-Means clusters similar customers without requiring labeled data.

Why not others? Classification algorithms require predefined labels.

Credit Card Fraud Detection
4. A bank wants to detect fraudulent credit-card transactions where fraud cases are rare compared to normal transactions.






Correct Answer: B
Difficulty: Interview-level
Data Type: Labeled, Imbalanced Dataset

Random Forest handles non-linearity and class imbalance effectively.

Why not others? Linear Regression cannot model classification boundaries.

Movie Recommendation System (Netflix-Style)
5. Netflix wants to recommend movies based on users’ viewing history and ratings from similar users.






Correct Answer: B
Difficulty: Medium
Data Type: Labeled User–Item Interactions

Collaborative Filtering leverages similarities among users or items.

Why not others? Regression models do not capture preference similarity.

Predicting Customer Churn (Telecom Dataset)
6. A telecom company wants to predict whether a customer will churn based on usage patterns and complaint history.






Correct Answer: A
Difficulty: Easy
Data Type: Labeled, Binary Target

Logistic Regression is designed for binary classification problems.

Why not others? PCA reduces features but does not classify.

Handwritten Digit Recognition (MNIST Dataset)
7. A system must recognize handwritten digits (0–9) from the MNIST image dataset.






Correct Answer: C
Difficulty: Medium
Data Type: Labeled Image Data

CNNs learn spatial features crucial for image recognition.

Why not others? Traditional ML models cannot exploit image structure.

Product Demand Forecasting (Walmart Sales Data)
8. Walmart wants to forecast next month’s product sales using historical daily sales data.






Correct Answer: B
Difficulty: Medium
Data Type: Time-Dependent Numerical Data

ARIMA models temporal dependencies in sequential data.

Why not others? K-Means ignores time ordering.

Identifying Frequent Product Bundles (Market Basket Analysis)
9. A supermarket wants to identify products that are frequently purchased together.






Correct Answer: B
Difficulty: Easy
Data Type: Transactional Data

Apriori discovers association rules from transaction records.

Why not others? Classification models do not find item associations.

Reducing Features in a High-Dimensional Dataset
10. A dataset contains 1,000 features, and the goal is to reduce dimensionality before training a model.






Correct Answer: B
Difficulty: Easy
Data Type: High-Dimensional Numerical Data

PCA reduces features while preserving maximum variance.

Why not others? K-Means clusters data but does not reduce dimensions.

Please visit, subscribe and share 10 Minutes Lectures in Computer Science

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents