Showing posts with label Machine Learning Quiz. Show all posts
Showing posts with label Machine Learning Quiz. Show all posts

Monday, October 27, 2025

Machine Learning Training Phase MCQs with Answers [2025 Updated]

Top 10 MCQs on Training of Machine Learning Models with Answers | Gradient Descent & Optimization Explained

 

 Top 10 MCQs on Training of Machine Learning Models with Answers | Gradient Descent & Optimization Explained

 

1. Loss Function Purpose

In supervised training, what is the primary role of the loss function?

A. To measure model speed
B. To measure how far predictions deviate from true labels
C. To determine the optimal learning rate
D. To normalize feature values

Answer: B
 

Explanation: The loss function quantifies prediction error, guiding weight adjustments during training. The loss function is the core compass that guides a model during training — without it, the model has no direction or measure of how well it’s performing.

Loss function is crucial

  • Gives feedback to the model
  • Shapes the optimization landscape
  • Controls bias/variance tradeoff 

 

2. Gradient Calculation

In gradient-based optimization, the gradient of the loss function represents:

A. The direction of the steepest descent
B. The direction of the steepest ascent
C. The curvature of the loss surface
D. The absolute value of the error

Answer: B
 

Explanation: The gradient points toward the steepest increase in loss; we move in the opposite direction to minimize it.

What does the gradient tell us?

When we train a model using gradient-based optimization (like gradient descent), we want to minimize the loss function — that is, make the model’s error as small as possible.

To do that, we need to know how the loss changes with respect to the model’s parameters (weights).

That’s exactly what the gradient tells us.

Why do we want to minimize the loss function here?

The gradient itself points toward the direction of maximum increase in the function (loss). But in gradient descent, we want to minimize the loss — so we move in the opposite direction of the gradient.

That’s why the update rule in gradient descent is:

wnew=woldη×L(w)w_{new} = w_{old} - \eta \times \nabla L(w) 

 

3. Backpropagation Core Idea

What is the main purpose of backpropagation in neural network training?

A. To store intermediate outputs
B. To propagate input forward
C. To compute gradients of weights using the chain rule
D. To normalize activations

Answer: C
 

Explanation: Backpropagation efficiently calculates partial derivatives of the loss with respect to each weight via the chain rule.

Backpropagation (Backward Propagation of Errors) is the algorithm used to train neural networks by adjusting their weights based on the error (loss) between predicted and true outputs.

It’s how the network learns from its mistakes.

 


 

4. Mini-Batch Training Advantage

Why is mini-batch gradient descent often preferred over batch or stochastic gradient descent?

A. It eliminates gradient noise completely
B. It balances computational efficiency with gradient stability
C. It always converges faster than batch descent
D. It uses no randomness

Answer: B
 

Explanation: Mini-batches provide more stable updates than stochastic GD and require less computation than full-batch GD.

What is mini-batch gradient descent?

Mini-batch gradient descent is a variant of gradient descent where the training dataset is divided into small batches (subsets) of data. The model updates its weights after processing each mini-batch, rather than after every single example or after the entire dataset. 

Mini-batch gradient descent is chosen over SGD or Batch gradient descent because of the characteristics faster training, stable convergence, memory efficient and GPU optimization. 


 

5. Weight Update Rule

In standard gradient descent, how are model weights updated?

A. wnew=wold+η×L(w)w_{new} = w_{old} + \eta \times \nabla L(w)
B. wnew=woldη×L(w)w_{new} = w_{old} - \eta \times \nabla L(w)
C. wnew=wold×L(w)w_{new} = w_{old} \times \nabla L(w)
D. wnew=η×woldw_{new} = \eta \times w_{old}

Answer: B
 

Explanation: We subtract the gradient scaled by the learning rate to move toward lower loss.

When training a model, the goal is to minimize the loss function L(w), which measures how far the model’s predictions are from the true outputs.

  • The weights ww of the model determine its predictions.

  • To reduce the loss, we need to adjust these weights in the “right direction.”

The gradient of the loss function w.r.t. the weights, L(w)\nabla L(w), tells us:

  • Direction: The direction in which the loss increases fastest.

  • Magnitude: How steeply the loss increases along each weight.

So if we follow the gradient as-is, we’d increase the loss — which is the opposite of what we want.

 


 

6. Vanishing Gradient Problem

Which activation function is most likely to cause the vanishing gradient problem?

A. ReLU
B. Leaky ReLU
C. Sigmoid
D. ELU

Answer: C
 

Explanation: Sigmoid saturates for large inputs, causing gradients to approach zero and slowing learning.

What is vanishing gradient problem?

When training deep neural networks using gradient-based optimization, the model updates its weights using gradients calculated via backpropagation. In some cases, the gradient becomes extremely small (approaching zero) as it propagates backward through the layers. Due to this, the weights in the earlier layers hardly update and the learning slows dramatically or stops. This is called the vanishing gradient problem.

It often happens with activation functions that “saturate” — i.e., functions whose output flattens for large positive or negative inputs. 


 

7. Convergence in Training

Which of the following best indicates training convergence?

A. The validation loss starts increasing
B. The training loss becomes zero
C. The change in loss across epochs becomes negligible
D. The learning rate decreases automatically

Answer: C
 

Explanation: Convergence occurs when further training no longer significantly changes the loss.

Training convergence?

Training convergence refers to the point during the training of a machine learning model where:

  • The loss function stops decreasing significantly.

  • The model parameters (weights) stabilize.

  • Further training does not improve performance on the training data (and ideally on validation data).

In simple words: the model has “learned as much as it can” from the data. 


 

8. Optimizer Momentum

What is the role of momentum in optimization algorithms like SGD with momentum?

A. To adapt the learning rate per parameter
B. To average losses across epochs
C. To accelerate convergence by smoothing gradient updates
D. To prevent overfitting

Answer: C
 

Explanation: Momentum accumulates past gradients to keep moving in consistent directions, improving speed and stability.

What is momentum in optimization algorithm?

Momentum is a technique used in gradient-based optimization (like stochastic gradient descent) to accelerate training and improve convergence, especially in deep neural networks. It helps the optimizer move faster in the right direction and smooth out oscillations. Think of it as adding “inertia” to the weight updates. 

Why momentum in optimization algorithm?

During training, gradient descent can face problems like Oscillations in narrow valleys (Gradients may point in zig-zag directions, slowing convergence) and/or Slow progress in shallow regions (Gradients are small so tiny updates; hence slow learning). Momentum solves both by accumulating past gradients and using them to influence the current update


 

9. Learning Rate Scheduler

Why might we use a learning rate scheduler during training?

A. To gradually reduce learning rate to fine-tune convergence
B. To reduce overfitting by randomizing learning rates
C. To restart training from previous checkpoints
D. To ensure constant learning rate

Answer: A
 

Explanation: Decaying the learning rate allows large early steps and fine adjustments later for stable convergence.

What is learning rate scheduler and why is needed?

A learning rate scheduler is a strategy to change the learning rate dynamically during training rather than keeping it constant. Typically, the learning rate starts larger at the beginning (It allows faster learning). Then it gradually decreases (allows smaller, precise steps to fine-tune convergence near minima).

Faster initial learning, Stable convergence, and Better final performance are the reasons for using a learning rate scheduler. 


 

10. Batch Normalization Effect

How does batch normalization help during training?

A. By eliminating the need for bias terms
B. By increasing model capacity
C. By forcing all activations to zero
D. By reducing vanishing/exploding gradients and speeding up convergence

Answer: D
 

Explanation: Batch normalization standardizes layer inputs, stabilizing gradient flow and allowing faster, more reliable training.



 

 

 

 

Saturday, October 18, 2025

10 Hot Decision Tree MCQs: Gain Ratio, Continuous Attributes & Tie-Breaking


10 Hot Decision Tree MCQs: Gain Ratio, Continuous Attributes & Tie-Breaking


1. The root node in a decision tree is selected based on:

A) Minimum entropy
B) Maximum information gain
C) Minimum Gini
D) Random initialization

Answer: B

Explanation: The root node is the first split in the tree. The goal is to reduce uncertainty in the dataset as much as possibleDecision tree algorithms (like ID3, C4.5) calculate information gain for all attributes. The attribute with the highest information gain is chosen as the root because it splits the data in the best way, creating the purest child nodes.
The root node is selected by picking the attribute that gives the largest reduction in entropy — i.e., the highest information gain.



2. If a dataset has 100% identical attribute values for all samples but mixed labels, the information gain of any attribute will be:

A) 0
B) 1
C) Undefined
D) Negative

Answer: A

Explanation: If all samples have the same attribute values, splitting on any attribute does not reduce uncertainty. Child nodes after the split are exactly the same as the parent in terms of class distribution. Therefore, the weighted entropy of children = entropy of parent. So, the information gain = 0.



3. In a two-class problem, Gini Index = 0.5 represents:

A) Maximum impurity
B) Pure split
C) Perfect classification
D) Minimum impurity

Answer: A

Explanation: Gini = 0 → node is pure (all samples belong to one class). Gini = 0.5 → node is maximally impure in a two-class problem (50%-50% split)Gini Index = 0.5 means the node is completely mixed, with an equal number of samples from both classes.



4. A pruned decision tree generally has:

A) Higher accuracy on training data but lower on test data
B) Lower training accuracy but better generalization
C) Equal accuracy everywhere
D) Random performance

Answer: B

Explanation: Pruning sacrifices some training accuracy to avoid overfittingPruning simplifies the tree. Slightly worse on training data but much better on new/unseen data.

Option A: NO - this is an overfitted tree, not a pruned one.
Option C: NO - Rare in practice
Option D: NO - Pruning is systematic not random.



5. In manual decision tree construction, if an attribute gives 0 information gain, what should you do?

A) Still choose it
B) Pplit based on it partially
C) Skip it for splitting 
D) Replace missing values

Answer: C

Explanation: If an attribute gives 0 information gain, it cannot help separate classes, so you ignore it and choose a better attribute for splitting.



6. In a decision tree, if a node contains only one sample, what is its entropy?

A) 0
B) 0.5
C) 1
D) Cannot be calculated

Answer: A

Explanation: A single sample belongs to a single class → node is perfectly pure → entropy = 0.



7. Which splitting criterion can be used for multi-class problems besides binary classification?

A) Gini Index
B) Entropy / Information Gain
C) Gain Ratio
D) All of the above

Answer: D

Explanation: All these measures can handle more than two classes; they just compute probabilities for each class.



8. Which of the following is most likely to cause overfitting in a decision tree?

A) Shallow tree
B) Large minimum samples per leaf
C) Very deep tree with small leaves
D) Using pruning

Answer: C

Explanation: Deep trees with tiny leaves memorize training data → overfit → poor generalization. 



9. In manual construction of a decision tree, what is the first step?

A) Calculate child node entropy
B) Select root attribute based on information gain
C) Split dataset randomly
D) Prune unnecessary branches

Answer: B

Explanation: The root is chosen to maximize information gain, which reduces the initial uncertainty the most.



10. If a node’s children after a split all have entropy = 0.3 and the parent has entropy = 0.3, what does it indicate?

A) Maximum information gain
B) Node is pure
C) Overfitting
D) No  information gain

Answer: D

Explanation: Information gain = Parent entropy − Weighted child entropy = 0 → the split did not improve purity.




 

10 Advanced Decision Tree MCQs: Splitting, Overfitting & Pruning Concepts


10 Advanced Decision Tree MCQs: Splitting, Overfitting & Pruning Concepts


1. When calculating information gain, the weight of each child node is proportional to:

A) Its entropy value
B) Number of samples in that node
C) Number of attributes used
D) Number of pure classes

Answer: B

Explanation: Each child’s influence on information gain depends on how many samples it contains — bigger child = bigger weight.



2. If a decision tree is very deep and fits perfectly to training data, the issue is:

A) Underfitting
B) Overfitting
C) Bias
D) Data leakage

Answer: B

Explanation: Overfitting happens when a decision tree learns the training data too perfectly, including all the little quirks and noise, instead of learning the general pattern. Very deep tree with many branches is a sign of overfitting.



3. Post-pruning is applied:

A) Before splitting
B) After the full tree is built
C) During initial data cleaning
D) During feature selection

Answer: B

Explanation: Pruning is the process of removing unnecessary branches or nodes from a decision tree to make it simpler and better at generalizing to new data. Post-pruning is done after the full decision tree has been built. Grow first, prune later for better generalization and less overfitting.
We usually remove branches that do not improve accuracy on validation/test data.



4. Which measure prefers attributes with many distinct values, causing possible overfitting?

A) Information Gain
B) Gain Ratio
C) Gini Index
D) Chi-square

Answer: A

Explanation: 

Information Gain (IG)Measures how much an attribute reduces entropyIG tends to favor attributes with many distinct values (like ID numbers) because they split the data into very small groups, often making each child pure. This can lead to overfitting — the tree memorizes the training data instead of learning patterns.

Gain RatioCorrects IG’s bias toward many-value attributes by normalizing it with the intrinsic information of the split.

Gini Index / Chi-squareDo not have the same strong bias as IG toward many distinct values.



5. In decision tree construction, continuous attributes (like “Age”) are handled by:

A) Ignoring them
B) Creating intervals or thresholds
C) Converting to categorical only
D) Rounding off values

Answer: B

Explanation: Continuous attributes are split at optimal cut-off points to convert them into “less-than / greater-than” branches so the tree can handle them effectively.



6. When all attribute values are the same but classes differ, what happens?

A) The tree stops
B) Merge classes
C) Add a new attribute
D) Randomly assign majority class

Answer: D

Explanation: In a decision tree, each split is based on an attribute that can separate the classesIf all attributes have the same value for the remaining samples, no further split is possibleBut if the samples have different classes, the node is impure.

What the algorithm does?

Since it cannot split further, it assigns the majority class to the node. This is a practical solution to handle situations where the tree can’t separate the classes anymore. 



7. In C4.5, the gain ratio is used to correct the bias of information gain toward attributes that have:

A) Many missing values
B) Continuous distributions
C) Many distinct values
D) Uniform entropy

Answer: C

Explanation: Information gain favors attributes with many distinct values, because splitting on them often creates very pure child nodes, giving high IG, even if the attributes are not meaningful for prediction. This lead to overfitting. 
C4.5 uses gain ratio to correct this bias. Gain ratio = Information Gain / Split Information. It penalizes attributes with many distinct values, preventing the tree from choosing them just because they split the data perfectly.



8. The entropy of a node with class probabilities [0.25, 0.75] is approximately:

A) 0.25
B) 0.56
C) 0.81
D) 1.0

Answer: C

Calculation:

0.25log2(0.25)0.75log2(0.75)=0.81-0.25\log_2(0.25) - 0.75\log_2(0.75) = 0.81




9. If a split divides a node into child nodes that have the same entropy as the parent node, what is the resulting information gain?

A) Zero
B) Equal to entropy of parent
C) One
D) Half of parent entropy

Answer: A

Explanation: If splitting a node doesn’t reduce entropy at all (child nodes are just as impure as the parent), the information gain = 0, meaning the split doesn’t improve the tree. 



10. Which of these combinations produces maximum information gain?

A) Parent entropy high, child entropies low
B) Parent entropy low, child entropies high
C) Both high
D) Both low

Answer: A

Explanation: The parent entropy represents the initial uncertaintyThe weighted sum of child entropies represents the remaining uncertainty after the splitIG is maximized when the parent is very uncertain (high entropy) and the split produces very pure child nodes (low entropy).
Maximum information gain occurs when a very uncertain parent is split into very pure children, because the split reduces the most uncertainty.





 

Top 10 Decision Tree MCQs for Manual Construction & Entropy Basics


Top 10 Decision Tree MCQs for Manual Construction & Entropy Basics


1. You have a dataset with attributes Weather = {Sunny, Rainy}, Wind = {Weak, Strong}, and the target variable Play = {Yes, No}. If all samples where Weather = Rainy have Play = No, what is the information gain of splitting on “Weather”?

A) 1.0
B) 0.0
C) Depends on other features
D) 0.5

Answer: A

Explanation: “Weather” perfectly separates the classes into pure subsets → Entropy = 0 → Information Gain = 1.



2. When manually constructing a decision tree, which step comes immediately before calculating information gain for each attribute?

A) Computing class probabilities
B) Normalizing data
C) Calculating entropy of the parent node
D) Pruning the tree

Answer: C

Explanation: Information Gain = Parent Entropy − Weighted Child Entropy. Hence, compute parent entropy first.



3. When two or more attributes have the same information gain, how does the decision tree algorithm choose the next attribute to split on?

A) Choosing the alphabetically first attribute
B) Randomly
C) Using gain ratio or another tie-breaking heuristic
D) Skipping the split

Answer: C

Explanation: Information gain tells us which attribute separates the data bestSometimes, two or more attributes give exactly the same gain — that means they are equally good for splitting. So, the algorithm needs a tie-breaker.

Most decision tree algorithms (like C4.5) use an extra measure called the gain ratio or another tie-breaking rule to decide which one to pick. If none of those are used, some implementations may just pick one randomly or based on a fixed order — but the standard approach is to use a heuristic like the gain ratio.



4. You are constructing a decision tree using Gini Index. For a node with class distribution [4 Yes, 1 No], what is the Gini value?

A) 0.16
B) 0.32
C) 0.48
D) 0.64

Answer: B

Explanation: 
Gini=1(pYes2+pNo2)=1(0.82+0.22)=0.32



5. In decision tree learning, entropy = 1 means:

A) The dataset is perfectly pure
B) The dataset is completely impure
C) The tree has overfitted
D) There is no need to split

Answer: B

Explanation: Entropy measures how mixed or impure a group of examples is. It tells us how uncertain we are about the class of a randomly chosen sample.
  • If entropy = 0, the data in that node is pure — all samples belong to the same class (no confusion).

  • If entropy = 1, the data is completely impure — classes are perfectly mixed (maximum confusion).



6. Which attribute will be chosen if one has high entropy but large sample size, and another has low entropy but few samples?

A) The one with higher entropy
B) The one with lower entropy
C) The one giving higher weighted information gain
D) Random choice

Answer: C

Explanation: When a decision tree decides where to split, it uses information gain, not just entropy. It doesn’t just look at how pure (low entropy) each split is. It also considers how many samples go into each child node — that’s the weight part.



7. When manually calculating entropy, what happens if all samples in a node belong to the same class?

A) Entropy = 0
B) Entropy = 1
C) Entropy = 0.5
D) Cannot be determined

Answer: A

Explanation: Entropy measures how mixed the data is — how much uncertainty there is about the class. 
  • If a node has samples of different classes, there is some confusion  entropy is greater than 0.

  • But if all samples belong to the same class, there is no confusion at all — we’re 100% sure of the class.

When there’s no uncertaintyEntropy=0



8. If attribute A reduces entropy by 0.4 and B reduces entropy by 0.2, which one should be chosen?

A) A
B) B
C) Either
D) None

Answer: A

Explanation: When building a decision tree, we pick the attribute that gives the largest reduction in entropy — this reduction is called information gainThe higher the information gain, the better the attribute at splitting the data and making the node purer. 



9. Which of the following is not a stopping criterion during manual decision tree construction?

A) All records belong to the same class
B) No remaining attributes
C) Entropy = 1
D) Minimum sample size reached

Answer: C

Explanation: Stopping criteria are conditions that tell us “stop splitting this node”. Common stopping criteria include:
  1. All records belong to the same class → Node is pure → stop splitting 

  2. No remaining attributes → Nothing left to split → stop splitting

  3. Minimum sample size reached → Node is too small to split reliably → stop splitting 



10. Suppose a dataset split yields subsets of size [10, 10] and [5, 0]. Which split is better in terms of information gain?

A) The first
B) The second
C) Both equal
D) Depends on class distribution

Answer: B

Explanation: The second subset is pure → lower entropy → higher information gain.




 

Thursday, October 16, 2025

Top 10 Python Linear Regression MCQs with Answers | Data Science Interview


Top 10 Python Linear Regression MCQs with Answers | Data Science Interview


1. Which Python library is most commonly used for implementing simple and multiple linear regression?

A. NumPy
B. scikit-learn
C. matplotlib
D. pandas

Answer:  B

Explanation: scikit-learn provides LinearRegression() for fitting linear models easily in Python.



2. In scikit-learn, after fitting a LinearRegression model, which attribute gives the coefficients?

A. model.predict()
B. model.score()
C. model.coef_
D. model.intercept_

Answer:  C

Explanation: model.coef_ contains the slope(s) for all independent variables, while model.intercept_ gives the bias term.



3. What is the purpose of train_test_split() in linear regression implementation?

A. To split features into numerical and categorical
B. To divide the dataset into training and testing sets
C. To normalize the dataset
D. To compute residuals

Answer:  B

Explanation: train_test_split() ensures model evaluation on unseen data, which helps detect overfitting.


4. Why do we often use StandardScaler() or MinMaxScaler() before applying linear regression?

A. To improve gradient descent convergence
B. Linear regression requires normalized residuals
C. To reduce heteroscedasticity automatically
D. To visualize data better

Answer:  A

Explanation: Scaling features improves numerical stability and speeds up convergence for algorithms like gradient descent.



5. In Python, which function calculates R² score for a fitted linear regression model?

A. r2_score(y_true, y_pred)
B. mean_squared_error(y_true, y_pred)
C. np.corrcoef(y, y_pred)
D. score()

Answer:  A

Explanation: sklearn.metrics.r2_score() computes the proportion of variance explained. You can also use model.score(X_test, y_test) in scikit-learn.



6. What is the shape of X when using scikit-learn’s LinearRegression for multiple features?

A. (n_samples,)
B. (n_samples, n_features)
C. (n_features, n_samples)
D. (n_features,)

Answer:  B

Explanation: scikit-learn expects 2D input: rows = samples, columns = features. For a single feature, X must be reshaped as (n_samples, 1).



7. When implementing linear regression in Python using gradient descent manually, which of the following must be computed in each iteration?

A. Residuals only
B. Partial derivatives of the cost function w.r.t coefficients
C. R² score
D. Train-test split

Answer:  B

Explanation: Gradient descent updates coefficients by computing gradients of the cost function (MSE) with respect to β. It iteratively adjusts each coefficient in the direction that reduces the error, continuing until the model converges to the minimum MSE.



8. Which of the following commands adds a bias (intercept) term when using NumPy for manual linear regression?

A. X = np.append(X, 1)
B. X = np.ones((n,1))
C. X = np.c_[np.ones((n,1)), X]
D. X = np.concatenate(X)

Answer:  C

Explanation: np.c_ concatenates a column of ones to X, representing the intercept term in the normal equation.



9. If y_pred = model.predict(X_test) in scikit-learn, how do you compute Mean Squared Error?

A. mse = np.mean(y_test - y_pred)
B. mse = np.mean((y_test - y_pred)**2)
C. mse = model.score(X_test, y_test)
D. mse = np.sqrt(np.sum(y_test - y_pred))

Answer:  B

Explanation: MSE is the mean of squared differences between actual and predicted values.



10. Which Python function/method can be used to detect multicollinearity before fitting a linear regression model?

A. model.score()
B. np.corrcoef() or Variance Inflation Factor (VIF)
C. train_test_split()
D. model.predict()

Answer:  B

Explanation: Correlation matrix or VIF helps detect highly correlated independent variables that may destabilize regression coefficients.





 

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents