Q: What is ridge regression used for?

Ridge regression is used to reduce overfitting and stabilize coefficient estimates when predictors are highly correlated by applying L2 regularization.

Q: What is the main difference between ridge and lasso regression?

Lasso regression (L1) can shrink coefficients exactly to zero, enabling feature selection, while ridge regression (L2) only shrinks coefficients without making them zero.

Q: What does the F-test check in multiple linear regression?

The F-test checks whether at least one predictor is useful in explaining the response variable by testing if all regression coefficients are jointly zero.

Q: What is a consequence of multicollinearity?

Severe multicollinearity inflates the variance of coefficient estimates, making them unstable, while not introducing bias into OLS estimates.

Question 1

Ridge regression can be used to address the problem of: [ETH, Zurich, Advanced Machine Learning, January 2021 - Final exam answers]

Answer

Correct Answer: B

Ridge regression introduces an L2 regularization term that penalizes large coefficients.

This helps reduce model variance and stabilizes coefficient estimates when predictors are highly correlated (multicollinearity).

As a result, ridge regression effectively controls overfitting in linear regression models without eliminating features.

Question 2

Compared with LASSO, ridge regression is more likely to end up with higher sparsity of coefficients. This statement is: [ETH, Zurich, Advanced Machine Learning, January 2021 - Final exam answers]

Answer

Correct Answer: C

LASSO (L1 regularization) can force some coefficients to become exactly zero, resulting in higher sparsity and implicit feature selection.

In contrast, ridge regression (L2 regularization) only shrinks coefficients toward zero but rarely makes them exactly zero.

Therefore, ridge regression typically produces less sparse solutions compared to LASSO.

Question 3

Which of the following is an advantage of Linear Regression over k-Nearest Neighbors (k-NN)? [University of Toronto, Machine Learning and Data Mining, Fall 2018 - Midterm exam answers]

Answer

Correct Answer: B

Linear Regression learns a global linear function, making it faster, interpretable, and better at generalizing for linear patterns.

In contrast, k-NN is non-parametric, requires storing the full dataset, and can be slow and less interpretable.

Question 4

Which of the following statements about linear regression and regularization is TRUE? [University of California, CS 189/289A Introduction to Machine Learning, Spring 2025 - Midterm exam answers]

Answer

Correct Answer: A

Ridge regression has a quadratic, convex cost function, so Newton’s Method reaches the global minimum in a single step.

Ridge regression has the following cost function:

J(w) = ∥Xw − y∥² + λ∥w∥₂²

This cost function is a quadratic and convex function in w.

For a quadratic function:

Newton’s method reaches the global minimum in one step, regardless of the starting point.

What is Newton's method in this context?

Newton’s method is an optimization algorithm used to find the minimum of a cost function. In linear / ridge regression, we use it to find the best weights 𝑤 that minimize the loss.

Newton’s method says: “Use both the slope (gradient) and the curvature (second derivative) of the cost function to jump directly toward the minimum.”

In simple terms; Gradient descent → walking downhill step by step. Newton’s method → jumping straight to the bottom if the surface is simple

Why other options are wrong?

Option B: Validation error may increase. Adding quadratic features: Increases model complexity and always lowers training error. But on the validation set: It may overfit and validation cost can increase. The word "always" makes this option wrong.
Option C: Newton's method required second derivatives and Lasso is non-differentiable. Newton’s method cannot be applied directly, and certainly not in one step.
Option D: Lasso uses L1 not L2.

Note: In practice, ridge regression is usually solved analytically rather than iteratively, but Newton’s method converges in one step due to the quadratic objective.

Question 5

Which of the following statements best explains why linear regression can be considered a special case of a neural network? [Carnegie Mellon University, 10-701 Machine Learning, Fall 2018 - Midterm exam answers - Modified Question]

Answer

Correct Answer: D

Linear regression can be viewed as a neural network with a single output neuron, no hidden layers, and a linear (identity) activation function, making it a special case of neural networks.

Mathematical Equivalence of Linear Regression and Neural Network

Linear Regression:

y = w^Tx + b

Single-Neuron Neural Network (No Activation):

y = w^Tx + b

They are identical: the same equation, the same parameters, and the same predictions.

Hence, linear regression is a special (degenerate) case of a neural network.

Question 6

When selecting additional input points x to label for a linear regression model, why is it desirable to choose points that are as spread out as possible? [University of Pennsylvania, CIS520 Machine Learning, 2019 – Final Exam (Modified)]

Answer

Correct Answer: A

Points that are far from existing data have higher leverage and provide new geometric information about the regression line. Labeling such points improves the conditioning of X^TX, thereby reducing variance in parameter estimates.

In simple terms, nearby points mostly repeat existing information, while distant points teach the model something new.

From a statistical perspective, the covariance of the least-squares estimator is σ²(X^TX)⁻¹; spread-out points improve the conditioning of X^TX, thereby reducing parameter variance.

Question 7

Consider a linear regression model. Suppose one of the input features is duplicated (i.e., an identical copy of a feature is added to the design matrix). Which of the following statements are correct? [University of California at Berkeley, CS189 Introduction to Machine Learning, 2015 – Final Exam (Modified)]

Answer

Correct Answer: B

What does “duplicating a feature” mean?

Suppose your original linear regression model is:

y = w₁x₁ + ε

Now, suppose you duplicate the feature x₁, meaning the same feature appears twice as identical columns in the design matrix. The model then becomes:

y = w₁x₁ + w₂x₁ + ε

This can be rewritten as:

(w₁ + w₂)x₁

Thus, the prediction capacity of the model remains unchanged. Duplicating a feature does not add new information; it merely allows the same effect to be distributed across multiple weights.

Explanation

Option A (Incorrect):
In ridge regression, the penalty term is λ ∑ w_i². When a feature is duplicated, its weight can be split across two identical features. This reduces the L2 penalty while keeping the model’s predictions unchanged.

Option B (Correct):
The Residual Sum of Squares (RSS) depends only on the model’s predictions. Duplicating a feature does not introduce any new information, so the minimum RSS remains unchanged.

Option C (Incorrect):
In lasso regression, the penalty term is λ ∑ |w_i|. Splitting a weight across duplicated features does not reduce the L1 penalty, since lasso encourages sparsity and prefers assigning the entire weight to a single feature.

Question 8

In a multiple linear regression model y = β₀ + β₁x₁ + β₂x₂ + ··· + βₖxₖ + ε, which of the following statements is TRUE? [Stanford University, CS229 Machine Learning – Exam Style]

Answer

Correct Answer: B

What is Ordinary Least Squares (OLS)?

Ordinary Least Squares (OLS) is a method used in linear regression to estimate the model parameters (coefficients). It chooses the coefficients so that the sum of squared differences between the observed values and the predicted values is as small as possible.

Why Option B is correct

OLS estimates the regression coefficients by minimizing the sum of squared residuals, i.e.,

RSS = ∑_i=1ⁿ (y_i − ŷ_i)²

This is the defining property of OLS. The solution is obtained by solving a convex optimization problem, which guarantees that the coefficients chosen minimize the residual sum of squares.

Why the other options are incorrect

Option A (Incorrect):
Adding more predictors does not guarantee better predictive accuracy. Although training error may decrease, test error can increase due to overfitting.

Option C (Incorrect):
Multicollinearity makes coefficient estimates unstable and difficult to interpret. It inflates the variance of the estimated regression coefficients.

Option D (Incorrect):
OLS estimates are biased if any regressor is correlated with the error term, which violates the exogeneity assumption.

Question 9

In multiple linear regression, what is the main purpose of the F-test? [Introductory Statistics / Machine Learning – Exam Style]

Answer

Correct Answer: C

Explanation:

A significant F-test indicates that the predictors jointly explain variation in the response variable, meaning at least one predictor is useful.

The F-test in multiple linear regression checks whether the regression model as a whole is useful. It tests whether at least one regression coefficient (excluding the intercept) is non-zero.

If the F-test is significant, it indicates that the predictors jointly help explain variation in the response variable.

Question 10

Which of the following is a direct consequence of severe multicollinearity in a multiple linear regression model? [University of California, Berkeley – CS189 Introduction to Machine Learning]

Answer

Correct Answer: C

Multicollinearity occurs when two or more predictors in a multiple linear regression model are highly correlated.

In such cases:

The matrix X^TX becomes nearly singular.
Its inverse (X^TX)⁻¹ contains very large values.
As a result, the variance of the OLS coefficient estimates increases.

Formally, the covariance matrix of OLS estimates is:

Var(β̂) = σ²(X^TX)⁻¹

When predictors are highly correlated, (X^TX)⁻¹ explodes, leading to unstable coefficient estimates.

Why the other options are incorrect

Option A (Incorrect):
Multicollinearity does not reduce standard errors; it increases them.

Option B (Incorrect):
High multicollinearity reduces statistical significance because inflated variances lead to larger p-values.

Option D (Incorrect):
Multicollinearity does not introduce bias into OLS estimates. OLS remains unbiased as long as regressors are uncorrelated with the error term.

Major links

Quicklinks

Friday, January 30, 2026

10 University-Level MCQs on Linear Regression, Ridge, Lasso & Multicollinearity (With Answers)

Introduction

What is Newton's method in this context?

Why other options are wrong?

Mathematical Equivalence of Linear Regression and Neural Network

What does “duplicating a feature” mean?

Explanation

What is Ordinary Least Squares (OLS)?

Why Option B is correct

Why the other options are incorrect

Why the other options are incorrect

No comments:

Post a Comment

Featured Content

Multiple choice questions in Natural Language Processing Home

All time most popular contents

Report Abuse