✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
What is perceptron?
A Perceptron is the simplest type of artificial neural network and is used for binary classification problems. It works like a decision-making unit that takes multiple inputs, multiplies each input by a weight, adds a bias, and then produces an output.
Mathematically, the perceptron computes a weighted sum of inputs and passes it through an activation function:
Perceptron Weight Update Using the Perceptron Learning Rule - Answer explained
Given:
- Inputs: x1 = 0, x2 = 0
- Bias input: x3 = +1
- Initial weights: w1 = 1, w2 = 1, w3 = 1
- Learning rate (α) = 1
- Desired (teacher) output: t = 0
- Activation function: Linear Threshold Unit (LTU)
Step 1: Net Input Calculation
net = w1x1 + w2x2 + w3x3
net = (1 × 0) + (1 × 0) + (1 × 1) = 1
Step 2: Actual Output
Since net ≥ 0, the LTU output is:
y = 1
Step 3: Error Calculation
error = t − y = 0 − 1 = −1
Step 4: Weight Update (Perceptron Learning Rule)
winew = wi + α(t − y)xi
Updated weights:
- w1new = 1 + (1)(−1)(0) = 1
- w2new = 1 + (1)(−1)(0) = 1
- w3new = 1 + (1)(−1)(1) = 0
Final Answer
After applying the Perceptron Learning Rule, the updated weights are:
- w1 = 1
- w2 = 1
- w3 = 0
Explanation: Since both input values are zero, the input weights remain unchanged. The perceptron incorrectly produced an output of 1, so the bias weight is reduced to lower the net input in future predictions.
Merge Using Single Linkage in Hierarchical Clustering
In Single Linkage hierarchical clustering, the distance between two clusters is defined as the minimum distance between any pair of points, one from each cluster.
Given Clusters
- C1 = {2, 4}
- C2 = {7, 8}
- C3 = {12, 14}
Inter-Cluster Distance Calculations
Distance between C1 and C2:
min{|2 − 7|, |2 − 8|, |4 − 7|, |4 − 8|} = min{5, 6, 3, 4} = 3
Distance between C2 and C3:
min{|7 − 12|, |7 − 14|, |8 − 12|, |8 − 14|} = min{5, 7, 4, 6} = 4
Distance between C1 and C3:
min{|2 − 12|, |2 − 14|, |4 − 12|, |4 − 14|} = min{10, 12, 8, 10} = 8
Conclusion
The smallest inter-cluster distance is d(C1, C2) = 3. Therefore, using Single Linkage, the clusters C1 and C2 are merged in the next iteration.
Resulting cluster: {2, 4, 7, 8}
What does the hyperparameter C mean in SVM?
In a soft-margin Support Vector Machine, the hyperparameter C controls the trade-off between:
- Maximizing the margin (simpler model)
- Minimizing classification error on training data
Explanation of each option
Increasing the hyperparameter C penalizes misclassified training points more heavily, forcing the SVM to fit the training data more accurately.
➜ Training error generally decreases.
Hard-margin SVM allows no misclassification and corresponds to C → ∞, not C = 0.
➜ With C = 0, misclassification is not penalized.
Increasing C makes the classifier fit the training data more strictly.
➜ Training error decreases, not increases.
A large C forces the decision boundary to accommodate even outliers.
➜ Sensitivity to outliers increases, not decreases.
Final Answer: Only Option A is true.
Exam Tip: Think of C as the cost of misclassification. High C → low training error but high sensitivity to outliers.
Kernel SVMs can implicitly operate in infinite-dimensional feature spaces via the kernel trick, while neural networks have finite-dimensional parameterizations.
Option (b):
An SVM can effectively map the data to an infinite-dimensional space; a neural net cannot.
The key idea here comes from the kernel trick. Kernel-based SVMs (such as those using the RBF kernel) implicitly operate in an infinite-dimensional Hilbert space.
- This mapping is done implicitly, without explicitly computing features.
- The number of learned parameters does not grow with the feature space.
- The optimization problem remains convex, guaranteeing a global optimum.
In contrast, neural networks:
- Operate in finite-dimensional parameter spaces (finite neurons and weights).
- Do not truly optimize over an infinite-dimensional feature space.
- Require explicit architectural growth to approximate higher complexity.
SVMs can exactly work in infinite-dimensional feature spaces via kernels, whereas neural networks can only approximate such mappings using finite architectures.
Why other options are INCORRECT?
- Option (a) — Incorrect: Neural networks can also learn non-linear transformations through hidden layers and activation functions.
- Option (c) — Incorrect: Unlike neural networks, SVMs solve a convex optimization problem and do not get stuck in local minima.
- Option (d) — Incorrect: The implicit feature space created by SVM kernels is typically harder—not easier—to interpret than neural network representations.
What does "training loss is much lower than the validation loss" mean?
A large gap between training and validation loss is a strong indicator of overfitting, where the model has low bias but high variance.
When the training loss is much lower than the validation loss, it means:
- The model is learning the training data too well, including noise and minor patterns.
- It fails to generalize to unseen data (validation set).
- In other words, the network performs well on seen data but poorly on new data.
Why this happens
- The model is too complex (too many layers or neurons).
- Insufficient regularization (e.g., low dropout, weak L2 penalty).
- Limited training data to learn generalized patterns.
- Training for too many epochs, allowing memorization of the training set.
Explanation: Why option C is correct?
A much lower training loss compared to validation loss indicates
overfitting. Increasing the L2 regularization weight
penalizes large model weights, discourages overly complex decision boundaries,
and improves generalization to unseen data.
Why the other options are incorrect
- Option A — Incorrect: Decreasing dropout reduces regularization and typically worsens overfitting.
- Option B — Incorrect: Increasing hidden layer size increases model capacity, making overfitting more likely.
- Option D — Incorrect: Adding more layers increases complexity and usually amplifies overfitting.
Note: When training loss ≪ validation loss, think regularization, simpler models, or more data.
How Decision Trees Handle Real-Valued Attributes
Traditional Approach (Binary Split)
For a real-valued attribute A, decision trees choose a threshold t and split the data as:
- A ≤ t
- A > t
This approach groups nearby values together, allowing the model to learn general patterns while keeping the decision tree simple and robust.
Pat’s Suggestion
Pat proposes using a multiway split, with one branch for each distinct value of the real-valued attribute.
If the attribute has many unique values (which is very common for real-valued data), this would create many branches—potentially one branch per training example.
What Goes Wrong?
1. Perfect Memorization of Training Data
- Each training example can end up in its own branch
- Leaf nodes become extremely “pure”
- The decision tree effectively memorizes the training set
👉 This usually results in very high (sometimes perfect) training accuracy.
2. Very Poor Generalization
- Test data often contains values not seen during training
- Even very close numeric values are treated as completely different
- The model cannot generalize across ranges of values
👉 This leads to poor performance on the test set.
Why Option (iii) Is the Biggest Problem
-
Option (i) Too computationally expensive ❌
Multiway splits increase complexity, but learning is still feasible and this is not the main issue. -
Option (ii) Bad on both training and test ❌
Incorrect, because training performance is usually very good. -
Option (iii) Good on training, bad on test ✅ (Correct)
This is a classic case of overfitting, where the model learns noise and exact values instead of true patterns. -
Option (iv) Good on test, bad on training ❌
Highly unlikely for a decision tree with this much flexibility.
Final Conclusion
Pat’s approach causes severe overfitting:
- Excellent training accuracy
- Poor generalization to unseen data
Therefore, the correct answer is:
(iii) It would probably result in a decision tree that scores well on the training set but badly on a test set.
The question is about model capacity / complexity, which directly controls the bias–variance trade-off.
Key Concept: Bias–Variance Trade-off
High Bias (Underfitting)
- Model is too simple
- Cannot capture underlying patterns
High Variance (Overfitting)
- Model is too complex
- Fits noise in the training data
The main factor controlling this trade-off is how expressive the model is.
Evaluating Each Option
Option (i) The Number of Hidden Nodes (Correct)
- It determines how many parameters the network has
- It controls how complex a function the network can represent
- Few hidden nodes → simple model → high bias (underfitting)
- Many hidden nodes → complex model → high variance (overfitting)
Overall, this directly controls the bias–variance trade-off.
Option (ii) The Learning Rate ❌
Affects training speed and convergence stability but does not change the model’s capacity or bias–variance behavior.
Option (iii) The Initial Choice of Weights ❌
Influences which local minimum is reached, but not the network structure or overall model complexity.
Option (iv) The Use of a Constant-Term Unit Input (Bias Unit) ❌
Allows shifting of activation functions, but has only a minor effect compared to the number of hidden nodes.
The question asks you to select the true statements about k-means clustering, specifically about Lloyd’s Algorithm, which is the standard algorithm used to solve k-means.
k-means is greedy, initialization-dependent, centroid-based, and increasing k never increases the optimal cost.
Answer explanation:
Key assumptions given:- No two sample points are equal (this avoids tie cases but does not change the main conclusions).
- We are reasoning about the k-means objective (cost) function, usually the sum of squared distances from points to their assigned cluster centroids.
Correct Answer: D
Increasing the number of clusters k can never increase the global minimum of the k-means cost function.
Why this is true:
- When k increases, the algorithm has more freedom to place centroids closer to the data points.
- The optimal cost for k + 1 clusters is never worse than for k clusters. In the worst case, we could reuse the same clustering as before.
- Since the number of data points is greater than the number of clusters, each cluster can contain at least one point, so the objective function remains valid.
Formally:
J*(k + 1) ≤ J*(k)
In the extreme case: k = n (one cluster per point) → each point is its own centroid → cost = 0.
Note: This statement is about the global minimum, not the solution found by Lloyd’s algorithm in practice, which may get stuck in a local minimum.
Why other options are wrong?
- Option A: Lloyd’s finds local minima, not global ❌
- Option B: Average linkage ≠ k-means ❌
- Option C: Initialization affects results ❌
“KNN has almost zero training cost because it does not learn a model; it only stores the data.”
Option-by-option explanation
- ❌ Logistic Regression: Training involves iterative optimization (gradient descent, Newton methods). Cost per iteration is 𝑂(𝑛⋅𝑑). Needs many passes over the data. Not the fastest for very large datasets.
- ❌ Neural Networks: Training requires multiple epochs and backpropagation. Computationally very expensive. Training time increases rapidly with: Data size, Number of layers. and Number of neurons. One of the slowest to train.
- ✅ K-Nearest Neighbors (KNN): KNN has essentially no training phase. Training consists of simply storing the dataset in memory. No optimization, no model fitting. Training time is approximately O(1) (or linear time to store data). Lowest training time, especially for very large datasets. ⚠️ Note: Prediction time is expensive, but that is not asked here
- ❌ Random Forests: Training involves building many decision trees. Each tree performs recursive splits. Training cost grows quickly with number of trees, depth of trees, and dataset size. Slow to train on very large datasets.
In linear regression, coefficient magnitude alone does not determine feature importance unless features are on comparable scales.
A high magnitude suggests that the feature is important. However, it may be the case that another feature is highly correlated with this feature and its coefficient also has a high magnitude with the opposite sign, in effect cancelling out the effect of the former. Thus, we cannot really remark on the importance of a feature just because its coefficient has a relatively large magnitude.
Why other options are wrong?
- Option A: The magnitude of a coefficient alone is misleading. Without knowing feature scaling, units, correlation with other features, regularization used etc., you cannot conclude that the feature has a “strong effect”. ❌
- Option B: A high-magnitude coefficient (even negative) indicates that the model is sensitive to that feature. Ignoring the feature based only on the sign or raw magnitude is unjustified. ❌
Frequently Asked Questions (Machine Learning)
What does it mean when training loss is much lower than validation loss?
When training loss is much lower than validation loss, it indicates that the model is overfitting. The model has learned the training data very well, including noise, but fails to generalize to unseen data. This usually occurs due to high model complexity or insufficient regularization.
Why does using a multiway split for real-valued attributes in decision trees cause problems?
Using a multiway split for real-valued attributes creates many branches, often one per unique value. This leads to overfitting, where the decision tree performs very well on training data but poorly on test data because the splits capture noise rather than general patterns.
Which machine learning algorithm has the lowest training time for very large datasets?
K-nearest neighbors (KNN) usually has the lowest training time because it does not learn an explicit model. Training simply involves storing the data, while most computation happens during prediction.
Does Lloyd’s algorithm for k-means clustering find the global minimum?
No, Lloyd’s algorithm does not guarantee finding the global minimum of the k-means objective function. It converges to a local minimum that depends on the initial choice of cluster centroids.
Does increasing the number of clusters (k) in k-means always reduce the cost function?
Increasing the number of clusters cannot increase the global minimum of the k-means cost function as long as the number of data points is greater than the number of clusters. The cost is non-increasing as k increases because additional clusters allow equal or better fitting of the data.
How should a large negative coefficient be interpreted in linear regression?
A large negative coefficient indicates that the feature is negatively correlated with the target variable. However, the magnitude alone does not determine feature importance unless features are on comparable scales. Additional information such as feature normalization is required.
How does increasing the number of hidden nodes affect bias and variance?
Increasing the number of hidden nodes generally reduces bias but increases variance. While a more complex model can better fit the training data, it also becomes more prone to overfitting.
Why is feature scaling important when interpreting linear model coefficients?
Feature scaling makes coefficients comparable across features in linear models. Without scaling, features measured in smaller units may appear more important due to larger coefficient values, even if their true effect on the target variable is small.