RNN vs LSTM: 10 MCQs with Answers & Detailed Explanations

Q: What is the main difference between RNN and LSTM?

The main difference is that a vanilla RNN updates its hidden state using simple recurrent multiplication, while LSTM introduces a cell state with gated additive updates. This allows LSTM to preserve long-term dependencies and reduce the vanishing gradient problem.

Q: Why do RNNs suffer from the vanishing gradient problem?

During backpropagation through time, gradients in RNNs are repeatedly multiplied by weight matrices and activation derivatives that are typically less than one. This causes gradients to shrink exponentially, making it difficult to learn long-range dependencies.

Q: How does LSTM solve the vanishing gradient problem?

LSTM reduces vanishing gradients by introducing a cell state that updates additively rather than purely multiplicatively. The input, forget, and output gates regulate information flow, allowing gradients to pass across long sequences more effectively.

Q: When should you use LSTM instead of RNN?

LSTM should be used when modeling long sequences where early information is important for later predictions, such as language modeling, text generation, and long time-series forecasting.

Q: Are RNNs computationally cheaper than LSTMs?

Yes. Vanilla RNNs are computationally simpler because they use fewer gates and weight matrices. LSTMs contain multiple gating mechanisms, making them more computationally expensive but more powerful for long-term dependency learning.

Question 1

During backpropagation through time (BPTT), the gradient in a vanilla RNN becomes very small for earlier time steps primarily because:

Answer

Correct Answer: B

Explanation:

In a vanilla RNN, gradients are propagated backward through multiple time steps during Backpropagation Through Time (BPTT). At each step, the gradient is multiplied by the recurrent weight matrix and the derivative of activation functions like tanh or sigmoid.

Since derivatives of tanh and sigmoid are typically less than 1, repeated multiplication across many time steps causes the gradient to shrink exponentially. This phenomenon is known as the vanishing gradient problem.

Mathematically, gradients contain terms like:

∂L/∂hₜ × W × W × W × ...

If |W| < 1, the gradient decays rapidly, preventing the network from learning long-range dependencies.

Question 2

Which architectural component in LSTM directly enables constant error flow across long sequences?

Answer

Correct Answer: C

Explanation:

The LSTM introduces a separate memory pathway called the cell state (Cₜ). Unlike RNN hidden states that are updated multiplicatively, the cell state is updated additively:

Cₜ = fₜCₜ₋₁ + iₜĈₜ

Because of this additive structure, gradients can flow backward without being repeatedly multiplied by small numbers. This creates what is often called the constant error carousel, which preserves long-term information.

Question 3

Consider a task where important information appears at the beginning of a 500-word sequence and is required at the end. Which statement is most accurate?

Answer

Correct Answer: C

Explanation:

Vanilla RNNs struggle with long-range dependencies due to vanishing gradients. Information from early time steps fades as it propagates through many transformations.

LSTM, however, uses gates (input, forget, output) to regulate information flow. The forget gate decides what to keep, allowing important early information to persist across hundreds of steps.

Thus, for long sequences (like 500 words), LSTM is structurally better suited.

Question 4

In an LSTM, if the forget gate output is always 1 and input gate is always 0, what happens?

Answer

Correct Answer: C

Explanation:

Using the LSTM update equation:

Cₜ = fₜCₜ₋₁ + iₜĈₜ

If fₜ = 1 and iₜ = 0:

Cₜ = 1·Cₜ₋₁ + 0 = Cₜ₋₁

Thus, the previous memory is preserved exactly. No new information is added and none is forgotten.

Question 5

Why does LSTM reduce the vanishing gradient problem compared to RNN?

Answer

Correct Answer: C

Explanation:

LSTM reduces vanishing gradients by creating an additive memory path. Instead of repeatedly multiplying hidden states (as in RNN), it updates the cell state using controlled addition.

Additive updates prevent exponential shrinkage of gradients, enabling better long-term learning.

Question 6

If sequence length is very small (e.g., 5 time steps), which is generally more computationally efficient?

Answer

Correct Answer: B

Explanation:

LSTM contains multiple gates (input, forget, output) and separate weight matrices, increasing computational cost.

For short sequences where long-term dependency is not required, a simple RNN is computationally cheaper and sufficient.

Question 7

Which equation difference is most responsible for LSTM’s long memory capability?

Answer

Correct Answer: B

Explanation:

The vanilla RNN updates hidden state multiplicatively:

hₜ = tanh(Wxₜ + Uhₜ₋₁)

This repeated multiplication causes gradients to either vanish or explode over time.

LSTM introduces a fundamentally different update rule:

Cₜ = fₜCₜ₋₁ + iₜ C̃ₜ

This additive memory update allows information to flow across time steps without being repeatedly multiplied. The forget gate (fₜ) controls retention, while the input gate (iₜ) regulates new information. This structural change is the key reason LSTM maintains long-term memory.

Question 8

An RNN fails to learn dependencies beyond 20 time steps. The most principled solution is:

Answer

Correct Answer: B

Explanation:

The inability to learn long dependencies is usually due to the vanishing gradient problem — a structural limitation of vanilla RNNs.

Increasing hidden size may increase capacity but does not solve gradient decay. Increasing epochs only trains longer but does not fix the underlying gradient instability.

Replacing the RNN with LSTM introduces gated memory mechanisms that explicitly preserve information over long sequences. Therefore, switching architectures is the most principled solution.

Question 9

Which statement best differentiates gradient propagation in RNN vs LSTM?

Answer

Correct Answer: A

Explanation:

In vanilla RNNs, gradients propagate through repeated matrix multiplications:

∂L/∂hₜ × U × U × U × ...

This purely multiplicative pathway causes exponential decay or explosion.

In LSTM, the cell state provides an additive path:

Cₜ = fₜCₜ₋₁ + iₜC̃ₜ

Because addition preserves magnitude better than repeated multiplication, gradients can flow more stably across long sequences.

Question 10

If all gates in LSTM are removed, the model effectively reduces to:

Answer

Correct Answer: C

Explanation:

LSTM extends the vanilla RNN by adding three gates: input, forget, and output gates. These gates control memory flow and prevent vanishing gradients.

If all gating mechanisms are removed, the architecture loses its controlled memory updates and effectively behaves like a standard recurrent neural network with simple hidden state recurrence.

Thus, LSTM without gates collapses into a vanilla RNN.

Major links

Quicklinks

Saturday, February 28, 2026

RNN vs LSTM – 10 HOT MCQs with Detailed Explanations (BPTT, Gradient Flow, Memory Mechanism)

RNN vs LSTM: 10 MCQs with Answers & Detailed Explanations

What You Will Learn

How to Attempt This Quiz

No comments:

Post a Comment

Featured Content

Multiple choice questions in Natural Language Processing Home

All time most popular contents

Report Abuse