🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Quiz Mode:

Introduction

Linear regression is one of the most fundamental models in statistics and machine learning, forming the basis for many advanced methods used in data science and artificial intelligence. Despite its simplicity, a deep understanding of linear regression involves important theoretical concepts such as ordinary least squares (OLS), regularization techniques, multicollinearity, optimization methods, and statistical hypothesis testing.

In this article, we present a curated set of advanced multiple-choice questions (MCQs) on linear regression and its variants, inspired by exam-style questions from top universities such as UC Berkeley, Stanford, MIT, CMU, and ETH Zürich. These questions are designed to test not only computational knowledge, but also conceptual clarity and theoretical reasoning.

Each question is accompanied by a clear explanation of the correct answer, making this resource especially useful for machine learning students, data science learners, and candidates preparing for university exams, interviews, or competitive tests. Topics covered include ridge and lasso regression, feature duplication, multicollinearity, the F-test, optimization methods, and the relationship between linear regression and neural networks.

Whether you are revising core concepts or aiming to strengthen your theoretical foundations, these MCQs will help you identify common misconceptions and develop a deeper understanding of linear regression models.

1.
Ridge regression can be used to address the problem of: [ETH, Zurich, Advanced Machine Learning, January 2021 - Final exam answers]






Correct Answer: B

Ridge regression introduces an L2 regularization term that penalizes large coefficients.

This helps reduce model variance and stabilizes coefficient estimates when predictors are highly correlated (multicollinearity).

As a result, ridge regression effectively controls overfitting in linear regression models without eliminating features.

2.
Compared with LASSO, ridge regression is more likely to end up with higher sparsity of coefficients. This statement is: [ETH, Zurich, Advanced Machine Learning, January 2021 - Final exam answers]






Correct Answer: C

LASSO (L1 regularization) can force some coefficients to become exactly zero, resulting in higher sparsity and implicit feature selection.

In contrast, ridge regression (L2 regularization) only shrinks coefficients toward zero but rarely makes them exactly zero.

Therefore, ridge regression typically produces less sparse solutions compared to LASSO.

3.
Which of the following is an advantage of Linear Regression over k-Nearest Neighbors (k-NN)? [University of Toronto, Machine Learning and Data Mining, Fall 2018 - Midterm exam answers]






Correct Answer: B

Linear Regression learns a global linear function, making it faster, interpretable, and better at generalizing for linear patterns.

In contrast, k-NN is non-parametric, requires storing the full dataset, and can be slow and less interpretable.

4.
Which of the following statements about linear regression and regularization is TRUE? [University of California, CS 189/289A Introduction to Machine Learning, Spring 2025 - Midterm exam answers]






Correct Answer: A

Ridge regression has a quadratic, convex cost function, so Newton’s Method reaches the global minimum in a single step.

Ridge regression has the following cost function:

J(w) = ∥Xw − y∥2 + λ∥w∥22

This cost function is a quadratic and convex function in w.

For a quadratic function:

  • Newton’s method reaches the global minimum in one step, regardless of the starting point.

What is Newton's method in this context?

Newton’s method is an optimization algorithm used to find the minimum of a cost function. In linear / ridge regression, we use it to find the best weights 𝑤 that minimize the loss.

Newton’s method says: “Use both the slope (gradient) and the curvature (second derivative) of the cost function to jump directly toward the minimum.”

In simple terms; Gradient descent → walking downhill step by step. Newton’s method → jumping straight to the bottom if the surface is simple

Why other options are wrong?

  • Option B: Validation error may increase. Adding quadratic features: Increases model complexity and always lowers training error. But on the validation set: It may overfit and validation cost can increase. The word "always" makes this option wrong.
  • Option C: Newton's method required second derivatives and Lasso is non-differentiable. Newton’s method cannot be applied directly, and certainly not in one step.
  • Option D: Lasso uses L1 not L2.

Note: In practice, ridge regression is usually solved analytically rather than iteratively, but Newton’s method converges in one step due to the quadratic objective.

5.
Which of the following statements best explains why linear regression can be considered a special case of a neural network? [Carnegie Mellon University, 10-701 Machine Learning, Fall 2018 - Midterm exam answers - Modified Question]






Correct Answer: D

Linear regression can be viewed as a neural network with a single output neuron, no hidden layers, and a linear (identity) activation function, making it a special case of neural networks.

Mathematical Equivalence of Linear Regression and Neural Network

Linear Regression:

y = wTx + b

Single-Neuron Neural Network (No Activation):

y = wTx + b

They are identical: the same equation, the same parameters, and the same predictions.

Hence, linear regression is a special (degenerate) case of a neural network.

6.
When selecting additional input points x to label for a linear regression model, why is it desirable to choose points that are as spread out as possible? [University of Pennsylvania, CIS520 Machine Learning, 2019 – Final Exam (Modified)]






Correct Answer: A

Points that are far from existing data have higher leverage and provide new geometric information about the regression line. Labeling such points improves the conditioning of XTX, thereby reducing variance in parameter estimates.

In simple terms, nearby points mostly repeat existing information, while distant points teach the model something new.

From a statistical perspective, the covariance of the least-squares estimator is σ2(XTX)−1; spread-out points improve the conditioning of XTX, thereby reducing parameter variance.

7.
Consider a linear regression model. Suppose one of the input features is duplicated (i.e., an identical copy of a feature is added to the design matrix). Which of the following statements are correct? [University of California at Berkeley, CS189 Introduction to Machine Learning, 2015 – Final Exam (Modified)]






Correct Answer: B

What does “duplicating a feature” mean?

Suppose your original linear regression model is:

y = w1x1 + ε

Now, suppose you duplicate the feature x1, meaning the same feature appears twice as identical columns in the design matrix. The model then becomes:

y = w1x1 + w2x1 + ε

This can be rewritten as:

(w1 + w2)x1

Thus, the prediction capacity of the model remains unchanged. Duplicating a feature does not add new information; it merely allows the same effect to be distributed across multiple weights.

Explanation

Option A (Incorrect):
In ridge regression, the penalty term is λ ∑ wi2. When a feature is duplicated, its weight can be split across two identical features. This reduces the L2 penalty while keeping the model’s predictions unchanged.

Option B (Correct):
The Residual Sum of Squares (RSS) depends only on the model’s predictions. Duplicating a feature does not introduce any new information, so the minimum RSS remains unchanged.

Option C (Incorrect):
In lasso regression, the penalty term is λ ∑ |wi|. Splitting a weight across duplicated features does not reduce the L1 penalty, since lasso encourages sparsity and prefers assigning the entire weight to a single feature.

8.
In a multiple linear regression model y = β₀ + β₁x₁ + β₂x₂ + ··· + βₖxₖ + ε, which of the following statements is TRUE? [Stanford University, CS229 Machine Learning – Exam Style]






Correct Answer: B

What is Ordinary Least Squares (OLS)?

Ordinary Least Squares (OLS) is a method used in linear regression to estimate the model parameters (coefficients). It chooses the coefficients so that the sum of squared differences between the observed values and the predicted values is as small as possible.

Why Option B is correct

OLS estimates the regression coefficients by minimizing the sum of squared residuals, i.e.,

RSS = ∑i=1n (yi − ŷi)2

This is the defining property of OLS. The solution is obtained by solving a convex optimization problem, which guarantees that the coefficients chosen minimize the residual sum of squares.

Why the other options are incorrect

Option A (Incorrect):
Adding more predictors does not guarantee better predictive accuracy. Although training error may decrease, test error can increase due to overfitting.

Option C (Incorrect):
Multicollinearity makes coefficient estimates unstable and difficult to interpret. It inflates the variance of the estimated regression coefficients.

Option D (Incorrect):
OLS estimates are biased if any regressor is correlated with the error term, which violates the exogeneity assumption.

9.
In multiple linear regression, what is the main purpose of the F-test? [Introductory Statistics / Machine Learning – Exam Style]






Correct Answer: C

Explanation:

A significant F-test indicates that the predictors jointly explain variation in the response variable, meaning at least one predictor is useful.

The F-test in multiple linear regression checks whether the regression model as a whole is useful. It tests whether at least one regression coefficient (excluding the intercept) is non-zero.


If the F-test is significant, it indicates that the predictors jointly help explain variation in the response variable.

10.
Which of the following is a direct consequence of severe multicollinearity in a multiple linear regression model? [University of California, Berkeley – CS189 Introduction to Machine Learning]






Correct Answer: C

Multicollinearity occurs when two or more predictors in a multiple linear regression model are highly correlated.

In such cases:

  • The matrix XTX becomes nearly singular.
  • Its inverse (XTX)−1 contains very large values.
  • As a result, the variance of the OLS coefficient estimates increases.

Formally, the covariance matrix of OLS estimates is:

Var(β̂) = σ2(XTX)−1

When predictors are highly correlated, (XTX)−1 explodes, leading to unstable coefficient estimates.

Why the other options are incorrect

Option A (Incorrect):
Multicollinearity does not reduce standard errors; it increases them.

Option B (Incorrect):
High multicollinearity reduces statistical significance because inflated variances lead to larger p-values.

Option D (Incorrect):
Multicollinearity does not introduce bias into OLS estimates. OLS remains unbiased as long as regressors are uncorrelated with the error term.