Machine Learning Training Phase MCQs with Answers [2025 Updated]

Top 10 MCQs on Training of Machine Learning Models with Answers | Gradient Descent & Optimization Explained

1. Loss Function Purpose

In supervised training, what is the primary role of the loss function?

A. To measure model speed
B. To measure how far predictions deviate from true labels
C. To determine the optimal learning rate
D. To normalize feature values

Answer: B

Explanation: The loss function quantifies prediction error, guiding weight adjustments during training. The loss function is the core compass that guides a model during training — without it, the model has no direction or measure of how well it’s performing.

Loss function is crucial

Gives feedback to the model
Shapes the optimization landscape
Controls bias/variance tradeoff

2. Gradient Calculation

In gradient-based optimization, the gradient of the loss function represents:

A. The direction of the steepest descent
B. The direction of the steepest ascent
C. The curvature of the loss surface
D. The absolute value of the error

Answer: B

Explanation: The gradient points toward the steepest increase in loss; we move in the opposite direction to minimize it.

What does the gradient tell us?

When we train a model using gradient-based optimization (like gradient descent), we want to minimize the loss function — that is, make the model’s error as small as possible.

To do that, we need to know how the loss changes with respect to the model’s parameters (weights).

That’s exactly what the gradient tells us.

Why do we want to minimize the loss function here?

The gradient itself points toward the direction of maximum increase in the function (loss). But in gradient descent, we want to minimize the loss — so we move in the opposite direction of the gradient.

That’s why the update rule in gradient descent is:

$w_{new} = w_{old} - \eta \times \nabla L(w)$

3. Backpropagation Core Idea

What is the main purpose of backpropagation in neural network training?

A. To store intermediate outputs
B. To propagate input forward
C. To compute gradients of weights using the chain rule
D. To normalize activations

Answer: C

Explanation: Backpropagation efficiently calculates partial derivatives of the loss with respect to each weight via the chain rule.

Backpropagation (Backward Propagation of Errors) is the algorithm used to train neural networks by adjusting their weights based on the error (loss) between predicted and true outputs.

It’s how the network learns from its mistakes.

4. Mini-Batch Training Advantage

Why is mini-batch gradient descent often preferred over batch or stochastic gradient descent?

A. It eliminates gradient noise completely
B. It balances computational efficiency with gradient stability
C. It always converges faster than batch descent
D. It uses no randomness

Answer: B

Explanation: Mini-batches provide more stable updates than stochastic GD and require less computation than full-batch GD.

What is mini-batch gradient descent?

Mini-batch gradient descent is a variant of gradient descent where the training dataset is divided into small batches (subsets) of data. The model updates its weights after processing each mini-batch, rather than after every single example or after the entire dataset.

Mini-batch gradient descent is chosen over SGD or Batch gradient descent because of the characteristics faster training, stable convergence, memory efficient and GPU optimization.

5. Weight Update Rule

In standard gradient descent, how are model weights updated?

A. $w_{new} = w_{old} + \eta \times \nabla L(w)$
B. $w_{new} = w_{old} - \eta \times \nabla L(w)$
C. $w_{new} = w_{old} \times \nabla L(w)$
D. $w_{new} = \eta \times w_{old}$

Answer: B

Explanation: We subtract the gradient scaled by the learning rate to move toward lower loss.

When training a model, the goal is to minimize the loss function $L (w)$ , which measures how far the model’s predictions are from the true outputs.

The weights $w$ of the model determine its predictions.
To reduce the loss, we need to adjust these weights in the “right direction.”

The gradient of the loss function w.r.t. the weights, $\nabla L(w)$ , tells us:

Direction: The direction in which the loss increases fastest.
Magnitude: How steeply the loss increases along each weight.

So if we follow the gradient as-is, we’d increase the loss — which is the opposite of what we want.

6. Vanishing Gradient Problem

Which activation function is most likely to cause the vanishing gradient problem?

A. ReLU
B. Leaky ReLU
C. Sigmoid
D. ELU

Answer: C

Explanation: Sigmoid saturates for large inputs, causing gradients to approach zero and slowing learning.

What is vanishing gradient problem?

When training deep neural networks using gradient-based optimization, the model updates its weights using gradients calculated via backpropagation. In some cases, the gradient becomes extremely small (approaching zero) as it propagates backward through the layers. Due to this, the weights in the earlier layers hardly update and the learning slows dramatically or stops. This is called the vanishing gradient problem.

It often happens with activation functions that “saturate” — i.e., functions whose output flattens for large positive or negative inputs.

7. Convergence in Training

Which of the following best indicates training convergence?

A. The validation loss starts increasing
B. The training loss becomes zero
C. The change in loss across epochs becomes negligible
D. The learning rate decreases automatically

Answer: C

Explanation: Convergence occurs when further training no longer significantly changes the loss.

Training convergence?

Training convergence refers to the point during the training of a machine learning model where:

The loss function stops decreasing significantly.
The model parameters (weights) stabilize.
Further training does not improve performance on the training data (and ideally on validation data).

In simple words: the model has “learned as much as it can” from the data.

8. Optimizer Momentum

What is the role of momentum in optimization algorithms like SGD with momentum?

A. To adapt the learning rate per parameter
B. To average losses across epochs
C. To accelerate convergence by smoothing gradient updates
D. To prevent overfitting

Answer: C

Explanation: Momentum accumulates past gradients to keep moving in consistent directions, improving speed and stability.

What is momentum in optimization algorithm?

Momentum is a technique used in gradient-based optimization (like stochastic gradient descent) to accelerate training and improve convergence, especially in deep neural networks. It helps the optimizer move faster in the right direction and smooth out oscillations. Think of it as adding “inertia” to the weight updates.

Why momentum in optimization algorithm?

During training, gradient descent can face problems like Oscillations in narrow valleys (Gradients may point in zig-zag directions, slowing convergence) and/or Slow progress in shallow regions (Gradients are small so tiny updates; hence slow learning). Momentum solves both by accumulating past gradients and using them to influence the current update.

9. Learning Rate Scheduler

Why might we use a learning rate scheduler during training?

A. To gradually reduce learning rate to fine-tune convergence
B. To reduce overfitting by randomizing learning rates
C. To restart training from previous checkpoints
D. To ensure constant learning rate

Answer: A

Explanation: Decaying the learning rate allows large early steps and fine adjustments later for stable convergence.

What is learning rate scheduler and why is needed?

A learning rate scheduler is a strategy to change the learning rate dynamically during training rather than keeping it constant. Typically, the learning rate starts larger at the beginning (It allows faster learning). Then it gradually decreases (allows smaller, precise steps to fine-tune convergence near minima).

Faster initial learning, Stable convergence, and Better final performance are the reasons for using a learning rate scheduler.

10. Batch Normalization Effect

How does batch normalization help during training?

A. By eliminating the need for bias terms
B. By increasing model capacity
C. By forcing all activations to zero
D. By reducing vanishing/exploding gradients and speeding up convergence

Answer: D

Explanation: Batch normalization standardizes layer inputs, stabilizing gradient flow and allowing faster, more reliable training.

TOPICS (Click to Navigate)

Monday, October 27, 2025