✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Introduction
Linear regression is one of the most fundamental models in statistics and machine learning, forming the basis for many advanced methods used in data science and artificial intelligence. Despite its simplicity, a deep understanding of linear regression involves important theoretical concepts such as ordinary least squares (OLS), regularization techniques, multicollinearity, optimization methods, and statistical hypothesis testing.
In this article, we present a curated set of advanced multiple-choice questions (MCQs) on linear regression and its variants, inspired by exam-style questions from top universities such as UC Berkeley, Stanford, MIT, CMU, and ETH ZΓΌrich. These questions are designed to test not only computational knowledge, but also conceptual clarity and theoretical reasoning.
Each question is accompanied by a clear explanation of the correct answer, making this resource especially useful for machine learning students, data science learners, and candidates preparing for university exams, interviews, or competitive tests. Topics covered include ridge and lasso regression, feature duplication, multicollinearity, the F-test, optimization methods, and the relationship between linear regression and neural networks.
Whether you are revising core concepts or aiming to strengthen your theoretical foundations, these MCQs will help you identify common misconceptions and develop a deeper understanding of linear regression models.
Ridge regression introduces an L2 regularization term that penalizes large coefficients.
This helps reduce model variance and stabilizes coefficient estimates when predictors are highly correlated (multicollinearity).
As a result, ridge regression effectively controls overfitting in linear regression models without eliminating features.
LASSO (L1 regularization) can force some coefficients to become exactly zero, resulting in higher sparsity and implicit feature selection.
In contrast, ridge regression (L2 regularization) only shrinks coefficients toward zero but rarely makes them exactly zero.
Therefore, ridge regression typically produces less sparse solutions compared to LASSO.
Linear Regression learns a global linear function, making it faster, interpretable, and better at generalizing for linear patterns.
In contrast, k-NN is non-parametric, requires storing the full dataset, and can be slow and less interpretable.
Ridge regression has a quadratic, convex cost function, so Newton’s Method reaches the global minimum in a single step.
Ridge regression has the following cost function:
J(w) = ∥Xw − y∥2 + λ∥w∥22
This cost function is a quadratic and convex function in w.
For a quadratic function:
- Newton’s method reaches the global minimum in one step, regardless of the starting point.
What is Newton's method in this context?
Newton’s method is an optimization algorithm used to find the minimum of a cost function. In linear / ridge regression, we use it to find the best weights π€ that minimize the loss.
Newton’s method says: “Use both the slope (gradient) and the curvature (second derivative) of the cost function to jump directly toward the minimum.”
In simple terms; Gradient descent → walking downhill step by step. Newton’s method → jumping straight to the bottom if the surface is simple
Why other options are wrong?
- Option B: Validation error may increase. Adding quadratic features: Increases model complexity and always lowers training error. But on the validation set: It may overfit and validation cost can increase. The word "always" makes this option wrong.
- Option C: Newton's method required second derivatives and Lasso is non-differentiable. Newton’s method cannot be applied directly, and certainly not in one step.
- Option D: Lasso uses L1 not L2.
Note: In practice, ridge regression is usually solved analytically rather than iteratively, but Newton’s method converges in one step due to the quadratic objective.
Linear regression can be viewed as a neural network with a single output neuron, no hidden layers, and a linear (identity) activation function, making it a special case of neural networks.
Mathematical Equivalence of Linear Regression and Neural Network
Linear Regression:
y = wTx + b
Single-Neuron Neural Network (No Activation):
y = wTx + b
They are identical: the same equation, the same parameters, and the same predictions.
Hence, linear regression is a special (degenerate) case of a neural network.
Points that are far from existing data have higher leverage and provide new geometric information about the regression line. Labeling such points improves the conditioning of XTX, thereby reducing variance in parameter estimates.
In simple terms, nearby points mostly repeat existing information, while distant points teach the model something new.
From a statistical perspective, the covariance of the least-squares estimator is σ2(XTX)−1; spread-out points improve the conditioning of XTX, thereby reducing parameter variance.
What does “duplicating a feature” mean?
Suppose your original linear regression model is:
y = w1x1 + Ξ΅
Now, suppose you duplicate the feature x1, meaning the same feature appears twice as identical columns in the design matrix. The model then becomes:
y = w1x1 + w2x1 + Ξ΅
This can be rewritten as:
(w1 + w2)x1
Thus, the prediction capacity of the model remains unchanged. Duplicating a feature does not add new information; it merely allows the same effect to be distributed across multiple weights.
Explanation
Option A (Incorrect):
In ridge regression, the penalty term is
Ξ» ∑ wi2.
When a feature is duplicated, its weight can be split across two identical features.
This reduces the L2 penalty while keeping the model’s predictions unchanged.
Option B (Correct):
The Residual Sum of Squares (RSS) depends only on the model’s predictions.
Duplicating a feature does not introduce any new information, so the minimum RSS
remains unchanged.
Option C (Incorrect):
In lasso regression, the penalty term is
Ξ» ∑ |wi|.
Splitting a weight across duplicated features does not reduce the L1 penalty,
since lasso encourages sparsity and prefers assigning the entire weight to a single feature.
What is Ordinary Least Squares (OLS)?
Ordinary Least Squares (OLS) is a method used in linear regression to estimate the model parameters (coefficients). It chooses the coefficients so that the sum of squared differences between the observed values and the predicted values is as small as possible.
Why Option B is correct
OLS estimates the regression coefficients by minimizing the sum of squared residuals, i.e.,
RSS = ∑i=1n (yi − Ε·i)2
This is the defining property of OLS. The solution is obtained by solving a convex optimization problem, which guarantees that the coefficients chosen minimize the residual sum of squares.
Why the other options are incorrect
Option A (Incorrect):
Adding more predictors does not guarantee better predictive accuracy.
Although training error may decrease, test error can increase due to
overfitting.
Option C (Incorrect):
Multicollinearity makes coefficient estimates unstable and difficult to
interpret. It inflates the variance of the estimated regression coefficients.
Option D (Incorrect):
OLS estimates are biased if any regressor is correlated with the
error term, which violates the exogeneity assumption.
Explanation:
A significant F-test indicates that the predictors jointly explain variation in the response variable, meaning at least one predictor is useful.
The F-test in multiple linear regression checks whether the regression model as a whole is useful. It tests whether at least one regression coefficient (excluding the intercept) is non-zero.
If the F-test is significant, it indicates that the predictors jointly help explain variation in the response variable.
Multicollinearity occurs when two or more predictors in a multiple linear regression model are highly correlated.
In such cases:
- The matrix XTX becomes nearly singular.
- Its inverse (XTX)−1 contains very large values.
- As a result, the variance of the OLS coefficient estimates increases.
Formally, the covariance matrix of OLS estimates is:
Var(Ξ²̂) = Ο2(XTX)−1
When predictors are highly correlated, (XTX)−1 explodes, leading to unstable coefficient estimates.
Why the other options are incorrect
Option A (Incorrect):
Multicollinearity does not reduce standard errors; it
increases them.
Option B (Incorrect):
High multicollinearity reduces statistical significance because inflated
variances lead to larger p-values.
Option D (Incorrect):
Multicollinearity does not introduce bias into OLS estimates.
OLS remains unbiased as long as regressors are uncorrelated with the error term.