One stop guide to computer science students for solved questions, Notes, tutorials, solved exercises, online quizzes, MCQs and more on DBMS, Advanced DBMS, Data Structures, Operating Systems, Machine learning, Natural Language Processing etc.
10 Hot Decision Tree MCQs: Gain Ratio, Continuous Attributes & Tie-Breaking
1. The root node in a decision tree is selected based on:
A) Minimum entropy
B) Maximum information gain
C) Minimum Gini
D) Random initialization
Answer:B
Explanation:The root node is the first split in the tree. The goal is to reduce uncertainty in the dataset as much as possible. Decision tree algorithms (like ID3, C4.5) calculate information gain for all attributes. The attribute with the highest information gain is chosen as the root because it splits the data in the best way, creating the purest child nodes.
The root node is selected by picking the attribute that gives the largest reduction in entropy — i.e., the highest information gain.
2. If a dataset has 100% identical attribute values for all samples but mixed labels, the information gain of any attribute will be:
A) 0
B) 1
C) Undefined
D) Negative
Answer:A
Explanation:If all samples have the same attribute values, splitting on any attribute does not reduce uncertainty. Child nodes after the split are exactly the same as the parent in terms of class distribution. Therefore, the weighted entropy of children = entropy of parent. So, the information gain = 0.
3. In a two-class problem, Gini Index = 0.5 represents:
A) Maximum impurity
B) Pure split
C) Perfect classification
D) Minimum impurity
Answer:A
Explanation:Gini = 0 → node is pure (all samples belong to one class). Gini = 0.5 → node is maximally impure in a two-class problem (50%-50% split). Gini Index = 0.5 means the node is completely mixed, with an equal number of samples from both classes.
4. A pruned decision tree generally has:
A) Higher accuracy on training data but lower on test data
B) Lower training accuracy but better generalization
C) Equal accuracy everywhere
D) Random performance
Answer:B
Explanation:Pruning sacrifices some training accuracy to avoid overfitting. Pruning simplifies the tree. Slightly worse on training data but much better on new/unseen data.
Option A: NO - this is an overfitted tree, not a pruned one.
Option C: NO - Rare in practice
Option D: NO - Pruning is systematic not random.
5. In manual decision tree construction, if an attribute gives 0 information gain, what should you do?
A) Still choose it
B) Pplit based on it partially
C) Skip it for splitting
D) Replace missing values
Answer: C
Explanation:If an attribute gives 0 information gain, it cannot help separate classes, so you ignore it and choose a better attribute for splitting.
6. In a decision tree, if a node contains only one sample, what is its entropy?
A) 0 B) 0.5 C) 1 D) Cannot be calculated
Answer:A
Explanation:A single sample belongs to a single class → node is perfectly pure → entropy = 0.
7. Which splitting criterion can be used for multi-class problems besides binary classification?
A) Gini Index B) Entropy / Information Gain C) Gain Ratio D) All of the above
Answer:D
Explanation:All these measures can handle more than two classes; they just compute probabilities for each class.
8. Which of the following is most likely to cause overfitting in a decision tree?
A) Shallow tree B) Large minimum samples per leaf C) Very deep tree with small leaves D) Using pruning
Answer:C
Explanation:Deep trees with tiny leaves memorize training data → overfit → poor generalization.
9. In manual construction of a decision tree, what is the first step?
A) Calculate child node entropy B) Select root attribute based on information gain C) Split dataset randomly D) Prune unnecessary branches
Answer:B
Explanation:The root is chosen to maximize information gain, which reduces the initial uncertainty the most.
10. If a node’s children after a split all have entropy = 0.3 and the parent has entropy = 0.3, what does it indicate?
A) Maximum information gain B) Node is pure C) Overfitting D) No information gain
Answer: D
Explanation:Information gain = Parent entropy − Weighted child entropy = 0 → the split did not improve purity.
10 Advanced Decision Tree MCQs: Splitting, Overfitting & Pruning Concepts
1. When calculating information gain, the weight of each child node is proportional to:
A) Its entropy value
B) Number of samples in that node
C) Number of attributes used
D) Number of pure classes
Answer:B
Explanation:Each child’s influence on information gain depends on how many samples it contains — bigger child = bigger weight.
2. If a decision tree is very deep and fits perfectly to training data, the issue is:
A) Underfitting
B) Overfitting
C) Bias
D) Data leakage
Answer:B
Explanation:Overfitting happens when a decision tree learns the training data too perfectly, including all the little quirks and noise, instead of learning the general pattern. Very deep tree with many branches is a sign of overfitting.
3. Post-pruning is applied:
A) Before splitting
B) After the full tree is built
C) During initial data cleaning
D) During feature selection
Answer:B
Explanation:Pruning is the process of removing unnecessary branches or nodes from a decision tree to make it simpler and better at generalizing to new data. Post-pruning is done after the full decision tree has been built. Grow first, prune later for better generalization and less overfitting.
We usually remove branches that do not improve accuracy on validation/test data.
4. Which measure prefers attributes with many distinct values, causing possible overfitting?
A) Information Gain
B) Gain Ratio
C) Gini Index
D) Chi-square
Answer:A
Explanation:
Information Gain (IG): Measures how much an attribute reduces entropy. IG tends to favor attributes with many distinct values (like ID numbers) because they split the data into very small groups, often making each child pure. This can lead to overfitting — the tree memorizes the training data instead of learning patterns.
Gain Ratio: Corrects IG’s bias toward many-value attributes by normalizing it with the intrinsic information of the split.
Gini Index / Chi-square: Do not have the same strong bias as IG toward many distinct values.
5. In decision tree construction, continuous attributes (like “Age”) are handled by:
A) Ignoring them
B) Creating intervals or thresholds
C) Converting to categorical only
D) Rounding off values
Answer:B
Explanation:Continuous attributes are split at optimal cut-off points to convert them into “less-than / greater-than” branches so the tree can handle them effectively.
6. When all attribute values are the same but classes differ, what happens?
A) The tree stops
B) Merge classes
C) Add a new attribute
D) Randomly assign majority class
Answer: D
Explanation:In a decision tree, each split is based on an attribute that can separate the classes. If all attributes have the same value for the remaining samples, no further split is possible. But if the samples have different classes, the node is impure.
What the algorithm does?
Since it cannot split further, it assigns the majority class to the node. This is a practical solution to handle situations where the tree can’t separate the classes anymore.
7. In C4.5, the gain ratio is used to correct the bias of information gain toward attributes that have:
A) Many missing values
B) Continuous distributions
C) Many distinct values
D) Uniform entropy
Answer: C
Explanation:Information gainfavors attributes with many distinct values, because splitting on them often creates very pure child nodes, giving high IG, even if the attributes are not meaningful for prediction. This lead to overfitting.
C4.5 uses gain ratio to correct this bias. Gain ratio = Information Gain / Split Information. It penalizes attributes with many distinct values, preventing the tree from choosing them just because they split the data perfectly.
8. The entropy of a node with class probabilities [0.25, 0.75] is approximately:
A) 0.25
B) 0.56
C) 0.81
D) 1.0
Answer:C
Calculation:
9. If a split divides a node into child nodes that have the same entropy as the parent node, what is the resulting information gain?
A) Zero
B) Equal to entropy of parent
C) One
D) Half of parent entropy
Answer:A
Explanation:If splitting a node doesn’t reduce entropy at all (child nodes are just as impure as the parent), the information gain = 0, meaning the split doesn’t improve the tree.
10. Which of these combinations produces maximum information gain?
A) Parent entropy high, child entropies low
B) Parent entropy low, child entropies high
C) Both high
D) Both low
Answer:A
Explanation:The parent entropy represents the initial uncertainty. The weighted sum of child entropies represents the remaining uncertainty after the split. IG is maximized when the parent is very uncertain (high entropy) and the split produces very pure child nodes (low entropy).
Maximum information gain occurs when a very uncertain parent is split into very pure children, because the split reduces the most uncertainty.
Top 10 Decision Tree MCQs for Manual Construction & Entropy Basics
1. You have a dataset with attributes Weather = {Sunny, Rainy}, Wind = {Weak, Strong}, and the target variable Play = {Yes, No}. If all samples where Weather = Rainy have Play = No, what is the information gain of splitting on “Weather”?
A) 1.0
B) 0.0
C) Depends on other features
D) 0.5
Answer:A
Explanation:“Weather” perfectly separates the classes into pure subsets → Entropy = 0 → Information Gain = 1.
2. When manually constructing a decision tree, which step comes immediately before calculating information gain for each attribute?
3. When two or more attributes have the same information gain, how does the decision tree algorithm choose the next attribute to split on?
A) Choosing the alphabetically first attribute
B) Randomly
C) Using gain ratio or another tie-breaking heuristic
D) Skipping the split
Answer:C
Explanation:Information gain tells us which attribute separates the data best. Sometimes, two or more attributes give exactly the same gain — that means they are equally good for splitting. So, the algorithm needs a tie-breaker.
Most decision tree algorithms (like C4.5) use an extra measure called the gain ratio or another tie-breaking rule to decide which one to pick. If none of those are used, some implementations may just pick one randomly or based on a fixed order — but the standard approach is to use a heuristic like the gain ratio.
4. You are constructing a decision tree using Gini Index. For a node with class distribution [4 Yes, 1 No], what is the Gini value?
A) 0.16
B) 0.32
C) 0.48
D) 0.64
Answer:B
Explanation:
5. In decision tree learning, entropy = 1 means:
A) The dataset is perfectly pure
B) The dataset is completely impure
C) The tree has overfitted
D) There is no need to split
Answer:B
Explanation:Entropy measures how mixed or impure a group of examples is. It tells us how uncertain we are about the class of a randomly chosen sample.
If entropy = 0, the data in that node is pure — all samples belong to the same class (no confusion).
If entropy = 1, the data is completely impure — classes are perfectly mixed (maximum confusion).
6. Which attribute will be chosen if one has high entropy but large sample size, and another has low entropy but few samples?
A) The one with higher entropy
B) The one with lower entropy
C) The one giving higher weighted information gain
D) Random choice
Answer:C
Explanation:When a decision tree decides where to split, it uses information gain, not just entropy. It doesn’t just look at how pure (low entropy) each split is. It also considers how many samples go into each child node — that’s the weight part.
7. When manually calculating entropy, what happens if all samples in a node belong to the same class?
A) Entropy = 0
B) Entropy = 1
C) Entropy = 0.5
D) Cannot be determined
Answer:A
Explanation:Entropy measures how mixed the data is — how much uncertainty there is about the class.
If a node has samples of different classes, there is some confusion— entropy is greater than 0.
But if all samples belong to the same class, there is no confusion at all — we’re 100% sure of the class.
When there’s no uncertainty, Entropy=0
8. If attribute A reduces entropy by 0.4 and B reduces entropy by 0.2, which one should be chosen?
A) A
B) B
C) Either
D) None
Answer:A
Explanation:When building a decision tree, we pick the attribute that gives the largest reduction in entropy — this reduction is called information gain. The higher the information gain, the better the attribute at splitting the data and making the node purer.
9. Which of the following is not a stopping criterion during manual decision tree construction?
A) All records belong to the same class
B) No remaining attributes
C) Entropy = 1
D) Minimum sample size reached
Answer:C
Explanation:Stopping criteria are conditions that tell us “stop splitting this node”. Common stopping criteria include:
All records belong to the same class → Node is pure → stop splitting
No remaining attributes → Nothing left to split → stop splitting
Minimum sample size reached → Node is too small to split reliably → stop splitting
10. Suppose a dataset split yields subsets of size [10, 10] and [5, 0]. Which split is better in terms of information gain?
A) The first
B) The second
C) Both equal
D) Depends on class distribution
Answer:B
Explanation:The second subset is pure → lower entropy → higher information gain.
Top 10 Python Linear Regression MCQs with Answers | Data Science Interview
1. Which Python library is most commonly used for implementing simple and multiple linear regression?
A. NumPy
B. scikit-learn
C. matplotlib
D. pandas
Answer: B
Explanation:scikit-learn provides LinearRegression() for fitting linear models easily in Python.
2. In scikit-learn, after fitting a LinearRegression model, which attribute gives the coefficients?
A. model.predict()
B. model.score()
C. model.coef_
D. model.intercept_
Answer: C
Explanation:model.coef_ contains the slope(s) for all independent variables, while model.intercept_ gives the bias term.
3. What is the purpose of train_test_split() in linear regression implementation?
A. To split features into numerical and categorical
B. To divide the dataset into training and testing sets
C. To normalize the dataset
D. To compute residuals
Answer: B
Explanation:train_test_split() ensures model evaluation on unseen data, which helps detect overfitting.
4. Why do we often use StandardScaler() or MinMaxScaler() before applying linear regression?
A. To improve gradient descent convergence
B. Linear regression requires normalized residuals
C. To reduce heteroscedasticity automatically
D. To visualize data better
Answer: A
Explanation:Scaling features improves numerical stability and speeds up convergence for algorithms like gradient descent.
5. In Python, which function calculates R² score for a fitted linear regression model?
A. r2_score(y_true, y_pred)
B. mean_squared_error(y_true, y_pred)
C. np.corrcoef(y, y_pred)
D. score()
Answer: A
Explanation:sklearn.metrics.r2_score() computes the proportion of variance explained. You can also use model.score(X_test, y_test) in scikit-learn.
6. What is the shape of X when using scikit-learn’s LinearRegression for multiple features?
A. (n_samples,)
B. (n_samples, n_features)
C. (n_features, n_samples)
D. (n_features,)
Answer: B
Explanation:scikit-learn expects 2D input: rows = samples, columns = features. For a single feature, X must be reshaped as (n_samples, 1).
7. When implementing linear regression in Python using gradient descent manually, which of the following must be computed in each iteration?
A. Residuals only
B. Partial derivatives of the cost function w.r.t coefficients
C. R² score
D. Train-test split
Answer: B
Explanation:Gradient descent updates coefficients by computing gradients of the cost function (MSE) with respect to β. It iteratively adjusts each coefficient in the direction that reduces the error, continuing until the model converges to the minimum MSE.
8. Which of the following commands adds a bias (intercept) term when using NumPy for manual linear regression?
A. X = np.append(X, 1)
B. X = np.ones((n,1))
C. X = np.c_[np.ones((n,1)), X]
D. X = np.concatenate(X)
Answer: C
Explanation:np.c_ concatenates a column of ones to X, representing the intercept term in the normal equation.
9. If y_pred = model.predict(X_test) in scikit-learn, how do you compute Mean Squared Error?
A. mse = np.mean(y_test - y_pred)
B. mse = np.mean((y_test - y_pred)**2)
C. mse = model.score(X_test, y_test)
D. mse = np.sqrt(np.sum(y_test - y_pred))
Answer: B
Explanation:MSE is the mean of squared differences between actual and predicted values.
10. Which Python function/method can be used to detect multicollinearity before fitting a linear regression model?
A. model.score()
B. np.corrcoef() or Variance Inflation Factor (VIF)
C. train_test_split()
D. model.predict()
Answer: B
Explanation:Correlation matrix or VIF helps detect highly correlated independent variables that may destabilize regression coefficients.
Top 10 Technical Linear Regression MCQs with Answers | Data Science Interview
Top 10 Technical Linear Regression MCQs with Answers | Data Science Interview
1. In linear regression, what is the primary purpose of the cost function (usually MSE)?
A. To calculate the correlation coefficient
B. To measure how well the model predicts the output
C. To compute residuals only
D. To normalize the input features
Answer: B
Explanation:The cost function, typically Mean Squared Error (MSE), measures the difference between predicted and actual values; optimization algorithms minimize this to fit the best line.
2. Which of the following can cause the linear regression coefficients to become unstable?
A. High variance in Y
B. Large dataset
C. Small residuals
D. Multicollinearity among X variables
Answer: D
Explanation:Multicollinearity occurs when independent variables are highly correlated, making it difficult to isolate their individual effect, leading to unstable coefficients.
3. In multiple linear regression, the design matrix X is of shape (n, p). What does p represent?
A. Number of data points
B. Number of features (including bias term if added)
C. Predicted values
D. Residuals
Answer: B
Explanation:p represents the number of independent variables (features) in the regression model.
4. Gradient descent may fail to converge in linear regression if:
A. The learning rate is too high
B. The data is normalized
C. The residuals are small
D. There are too many features
Answer: A
Explanation:A high learning rate can cause overshooting of the minimum, preventing convergence.
5. Ridge regression differs from standard linear regression in that it:
A. Minimizes absolute residuals instead of squared residuals
B. Adds L2 regularization to penalize large coefficients
C. Works only with categorical variables
D. Eliminates all multicollinearity automatically
Answer: B
Explanation:Ridge regression adds an L2 penalty term to reduce overfitting and control coefficient magnitude.
6. What is the effect of heteroscedasticity on linear regression models?
A. It affects the efficiency of coefficient estimates
B. It biases the coefficient estimates
C. It changes the slope direction
D. It increases R² automatically
Answer: A
Explanation:Heteroscedasticity doesn’t bias coefficients but makes standard errors unreliable, affecting confidence intervals and hypothesis testing.
What is Heteroscedasticity?
Heteroscedasticity is a statistical property. It occurs when the variance of the residuals is not constant across all levels of the independent variable(s). In other words, the spread (or "scatter") of the residuals changes as the predicted value or an independent variable changes. The opposite case, where residuals have constant variance, is called homoscedasticity.
7. Which technique can you use to check if a linear regression model is overfitting?
A. Check R² on training data only
B. Evaluate model performance on a separate validation/test set
C. Compute residual sum of squares only
D. Increase learning rate
Answer: B
Explanation:Overfitting is detected by poor performance on unseen data compared to training data.
8. What is the closed-form solution (normal equation) for β in linear regression?
A. β = XᵀXy
B. β = X⁻¹y
C. β = (XᵀX)⁻¹ Xᵀy
D. β = yX
Answer: C
Explanation:The normal equation provides the optimal coefficient vector:
9. In linear regression, adding irrelevant features typically:
A. Reduces bias and increases variance
B. Increases R² but may reduce generalization
C. Decreases all residuals to zero
D. Has no effect on coefficients
Answer: B
Explanation:Irrelevant features may artificially increase R² but often reduce performance on new data (overfitting).
10. The p-value associated with a coefficient in linear regression indicates:
A. The probability that the coefficient is exactly zero
B. The R² of the model
C. The magnitude of the residuals
D. The significance of that feature in explaining Y
Answer: D
Explanation:A low p-value (<0.05) suggests that the corresponding predictor significantly contributes to explaining the variability in Y.