🚨 Quiz Instructions:
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Quiz Mode:
1.
Consider a Perceptron that has two input units and one output unit, which uses an LTU activation function, plus a bias input of +1 and a bias weight w3 = 1. If both inputs associated with an example are 0 and both weights, w1 and w2, connecting the input units to the output unit have value 1, and the desired (teacher) output value is 0, how will the weights change after applying the Perceptron Learning rule with learning rate parameter α = 1? [University of Wisconsin–Madison, CS540-2: Introduction to Artificial Intelligence, May 2018 - Final exam answers]







Correct Answer: D

What is perceptron?

A Perceptron is the simplest type of artificial neural network and is used for binary classification problems. It works like a decision-making unit that takes multiple inputs, multiplies each input by a weight, adds a bias, and then produces an output.

Mathematically, the perceptron computes a weighted sum of inputs and passes it through an activation function:

Perceptron Weight Update Using the Perceptron Learning Rule - Answer explained

Given:

  • Inputs: x1 = 0, x2 = 0
  • Bias input: x3 = +1
  • Initial weights: w1 = 1, w2 = 1, w3 = 1
  • Learning rate (α) = 1
  • Desired (teacher) output: t = 0
  • Activation function: Linear Threshold Unit (LTU)

Step 1: Net Input Calculation

net = w1x1 + w2x2 + w3x3
net = (1 × 0) + (1 × 0) + (1 × 1) = 1

Step 2: Actual Output

Since net ≥ 0, the LTU output is:
y = 1

Step 3: Error Calculation

error = t − y = 0 − 1 = −1

Step 4: Weight Update (Perceptron Learning Rule)

winew = wi + α(t − y)xi

Updated weights:

  • w1new = 1 + (1)(−1)(0) = 1
  • w2new = 1 + (1)(−1)(0) = 1
  • w3new = 1 + (1)(−1)(1) = 0

Final Answer

After applying the Perceptron Learning Rule, the updated weights are:

  • w1 = 1
  • w2 = 1
  • w3 = 0

Explanation: Since both input values are zero, the input weights remain unchanged. The perceptron incorrectly produced an output of 1, so the bias weight is reduced to lower the net input in future predictions.

2.
consider a dataset containing six one-dimensional points: {2, 4, 7, 8, 12, 14}. After three iterations of Hierarchical Agglomerative Clustering using Euclidean distance between points, we get the 3 clusters: C1 = {2, 4}, C2 = {7, 8} and C3 = {12, 14}.? [University of Wisconsin–Madison, CS540: Introduction to Artificial Intelligence, October 2019 - Midterm exam answers]







Correct Answer: A

Merge Using Single Linkage in Hierarchical Clustering

In Single Linkage hierarchical clustering, the distance between two clusters is defined as the minimum distance between any pair of points, one from each cluster.

Given Clusters

  • C1 = {2, 4}
  • C2 = {7, 8}
  • C3 = {12, 14}

Inter-Cluster Distance Calculations

Distance between C1 and C2:

min{|2 − 7|, |2 − 8|, |4 − 7|, |4 − 8|} = min{5, 6, 3, 4} = 3

Distance between C2 and C3:

min{|7 − 12|, |7 − 14|, |8 − 12|, |8 − 14|} = min{5, 7, 4, 6} = 4

Distance between C1 and C3:

min{|2 − 12|, |2 − 14|, |4 − 12|, |4 − 14|} = min{10, 12, 8, 10} = 8

Conclusion

The smallest inter-cluster distance is d(C1, C2) = 3. Therefore, using Single Linkage, the clusters C1 and C2 are merged in the next iteration.

Resulting cluster: {2, 4, 7, 8}

3.
Which of the following are true of support vector machines? [University of California at Berkeley, CS189: Introduction to Machine Learning, Spring 2019 - Final exam answers]







Correct Answer: A

What does the hyperparameter C mean in SVM?

In a soft-margin Support Vector Machine, the hyperparameter C controls the trade-off between:

  • Maximizing the margin (simpler model)
  • Minimizing classification error on training data

Explanation of each option

Option A — TRUE
Increasing the hyperparameter C penalizes misclassified training points more heavily, forcing the SVM to fit the training data more accurately.
➜ Training error generally decreases.
Option B — FALSE
Hard-margin SVM allows no misclassification and corresponds to C → ∞, not C = 0.
➜ With C = 0, misclassification is not penalized.
Option C — FALSE
Increasing C makes the classifier fit the training data more strictly.
➜ Training error decreases, not increases.
Option D — FALSE
A large C forces the decision boundary to accommodate even outliers.
➜ Sensitivity to outliers increases, not decreases.

Final Answer: Only Option A is true.

Exam Tip: Think of C as the cost of misclassification. High C → low training error but high sensitivity to outliers.

4.
Which of the following might be valid reasons for preferring an SVM over a neural network? [Indian Institute of Technology Delhi, ELL784: Introduction to Machine Learning, 2017 - 18 - Exam answers]







Correct Answer: B
Kernel SVMs can implicitly operate in infinite-dimensional feature spaces via the kernel trick, while neural networks have finite-dimensional parameterizations.

Option (b):
An SVM can effectively map the data to an infinite-dimensional space; a neural net cannot.

The key idea here comes from the kernel trick. Kernel-based SVMs (such as those using the RBF kernel) implicitly operate in an infinite-dimensional Hilbert space.

  • This mapping is done implicitly, without explicitly computing features.
  • The number of learned parameters does not grow with the feature space.
  • The optimization problem remains convex, guaranteeing a global optimum.

In contrast, neural networks:

  • Operate in finite-dimensional parameter spaces (finite neurons and weights).
  • Do not truly optimize over an infinite-dimensional feature space.
  • Require explicit architectural growth to approximate higher complexity.

SVMs can exactly work in infinite-dimensional feature spaces via kernels, whereas neural networks can only approximate such mappings using finite architectures.

Why other options are INCORRECT?

  • Option (a) — Incorrect: Neural networks can also learn non-linear transformations through hidden layers and activation functions.
  • Option (c) — Incorrect: Unlike neural networks, SVMs solve a convex optimization problem and do not get stuck in local minima.
  • Option (d) — Incorrect: The implicit feature space created by SVM kernels is typically harder—not easier—to interpret than neural network representations.
5.
Suppose that you are training a neural network for classification, but you notice that the training loss is much lower than the validation loss. Which of the following is the most appropriate way to address this issue? [Stanford University, CS224N: Natural Language Processing with Deep Learning Winter 2018 - Midterm exam answers]






Correct Answer: C

What does "training loss is much lower than the validation loss" mean?

A large gap between training and validation loss is a strong indicator of overfitting, where the model has low bias but high variance.

When the training loss is much lower than the validation loss, it means:

  • The model is learning the training data too well, including noise and minor patterns.
  • It fails to generalize to unseen data (validation set).
  • In other words, the network performs well on seen data but poorly on new data.

Why this happens

  • The model is too complex (too many layers or neurons).
  • Insufficient regularization (e.g., low dropout, weak L2 penalty).
  • Limited training data to learn generalized patterns.
  • Training for too many epochs, allowing memorization of the training set.

Explanation: Why option C is correct?
A much lower training loss compared to validation loss indicates overfitting. Increasing the L2 regularization weight penalizes large model weights, discourages overly complex decision boundaries, and improves generalization to unseen data.

Why the other options are incorrect

  • Option A — Incorrect: Decreasing dropout reduces regularization and typically worsens overfitting.
  • Option B — Incorrect: Increasing hidden layer size increases model capacity, making overfitting more likely.
  • Option D — Incorrect: Adding more layers increases complexity and usually amplifies overfitting.

Note: When training loss ≪ validation loss, think regularization, simpler models, or more data.