Model Validation in Machine Learning – 10 HOT MCQs with Answers | Cross-Validation, Hold-Out & Nested CV Explained
A. Accuracy is still valid
B. Accuracy may be optimistically biased
C. Folds were too small
D. It prevents data leakage
When data preprocessing—such as scaling, normalization, or feature selection—is applied after splitting (i.e., on the entire dataset before dividing into folds), information from the validation/test set can inadvertently leak into the training process. This leakage inflates the measured performance, causing results like the reported 95% accuracy to be higher than what the model would achieve on truly unseen data. This is a well-known issue in cross-validation and machine learning validation.
Correct procedure of data preprocessing in cross-validation
Proper practice is to split the data first, then apply preprocessing separately to each fold to avoid biasing results.
For each fold:
-
Split → Training and Validation subsets
-
Fit preprocessing only on training data
-
Transform both training and validation sets
-
Train model
-
Evaluate
A. Nested cross-validation
B. Random train/test split without stratification
C. Cross-validation on dataset used for feature selection
D. Stratified k-fold
- This causes data leakage,
- which makes accuracy look higher than it truly is,
- hence the performance is overestimated.
A. CV average
B. Retrain on full data and test on held-out test set
C. Best fold score
D. Validation score after tuning
A. Too little training data
B. Needs resampling
C. Fold too large
D. Almost all data used for training
A. Independent samples
B. Predicting future from past
C. Imbalanced data
D. Faster training
Explanation:
Time Series Cross-Validation (TSCV) is used when data points are ordered over time — for example, stock prices, weather data, or sensor readings.
- The order of data matters.
- Future values depend on past patterns.
- You must not shuffle the data, or it will leak future information.
Unlike standard k-fold cross-validation, TSCV respects the chronological order and ensures that the model is trained only on past data and evaluated on future data, mimicking real-world forecasting scenarios.
A. Bootstrapping
B. Leave-p-out
C. Monte Carlo Cross-Validation
D. Nested CV
Explanation: Monte Carlo validation averages performance over multiple random splits.
Monte Carlo Cross-Validation (also known as Repeated Random Subsampling Validation) involves randomly splitting the dataset into training and testing subsets multiple times (e.g., 80% training and 20% testing).
The model is trained and evaluated on these splits repeatedly, and the results (such as accuracy) are averaged to estimate the model's performance.
This differs from k-fold cross-validation because the splits are random and may overlap — some data points might appear in multiple test sets or not appear at all in some iterations.
When is Monte Carlo Cross-Validation useful?
- You have limited data but want a more reliable performance estimate.
- You want flexibility in training/test split sizes.
- The dataset is large, and full k-fold CV is too slow.
- You don’t need deterministic folds.
- The data are independent and identically distributed (i.i.d.).
A. Too many folds
B. Overfitting during tuning
C. Underfitted model
D. Large test set
A. Single 80/20 split
B. Nested CV
C. Stratified 10-fold
D. Leave-One-Out
How does Nested CV handle optimistic bias?
- Inner loop: Used exclusively to tune the model's hyperparameters by cross-validation on the training data.
- Outer loop: Used to evaluate the generalized performance of the model with the tuned hyperparameters on a held-out test fold that was never seen during the inner tuning.
When to use Nested Cross-Validation?
A. Ensures higher accuracy
B. Eliminates overfitting
C. Uses full dataset efficiently
D. Requires less computation
A. Improve training accuracy
B. Reduce dataset size
C. Reduce training time
D. Measure generalization to unseen data
Explanation: Validation estimates generalization performance before final testing.
