✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Conceptual MCQs on Transformer and Large Language Model Architectures
These conceptual multiple-choice questions (MCQs) cover core ideas behind Transformer architectures and large language models (LLMs), including BERT, GPT-style models, self-attention, positional embeddings, and in-context learning. Each question is followed by a clear, exam- and interview-oriented explanation.
These questions are useful for machine learning students, NLP researchers, and anyone preparing for deep learning exams or technical interviews.
Topics Covered
- Self-attention vs RNNs
- BERT masking strategy
- GPT scaling behavior
- Positional embeddings
- In-context learning
This question acutally tests "Why can a shallow Transformer understand long documents better than a very deep RNN?".
The key difference lies in how information flows across long distances in a sequence. Transformers handle long-range dependencies efficiently because self-attention allows any token to directly attend to any other token, unlike RNNs where information must propagate sequentially.
What happens in an RNN?
In an RNN (even LSTM/GRU): Information from an early token must pass step by step through every intermediate token. For a document of length n, the dependency path length is O(n).
So for long documents, important early information gets weakened or distorted. Learning long-range dependencies becomes very hard. Eventhough the depth helps, it cannot remove this sequential bottleneck.
What happens in a Transformer?
With self-attention every token can directly attend to any other token. Dependency path length is O(1) (one attention step). So even with few layers, a word at the start can influence a word at the end immediately. Long-range relationships (coreference, topic continuity, constraints) are preserved.
This is why context window length matters more than depth for long documents.
In simpler terms,
Self-attention's ability to create direct connections between any two tokens in the sequence, regardless of their distance, fundamentally solves the long-range dependency problem that plagues RNNs. In a single attention operation, token i can directly interact with token j, establishing a maximum path length of O(1) between them. This architectural property means that:
- Information flows directly
- Gradients propagate effectively
- Long-range dependencies become learnable
Why other options are INCORRECT?
Option A: INCORRECT. Embeddings are not the main reason — RNNs can also use high-quality embeddings.
Option C: INCORRECT. This applies mainly to BERT, not all Transformers. Also, bidirectionality alone does not solve long dependency paths.
Option D: INCORRECT. Parallelization affects training speed, not the model’s ability to understand long documents.
BERT masks only about 15% of tokens to ensure that most inputs during pretraining resemble real text, thereby reducing the distribution mismatch between pretraining and fine-tuning.
Explanation:
What is meant by distribution shift between pretraining and fine-tuning?
Distribution shift between pretraining and fine-tuning means the kind of inputs the model sees during pretraining are different from what it sees later when we actually use it.
Alternate definition: Distribution shift is the mismatch between the input patterns seen during model training and those encountered during fine-tuning or inference.
Example: During BERT pretraining, inputs are with [MASK]. "The capital of France is [MASK]". During fine-tuning/real use, the same input looks without [MASK] like "The capital of France is Paris."
Why masking of tokens help in solving the distribution shift?
By including unmasked tokens and random tokens mixed in with [MASK] tokens during pretraining, the model builds robustness to inputs without the special masking token. When fine-tuning arrives and the [MASK] token suddenly disappears, the model has already learned patterns that apply to token-level inputs that aren't artificially masked. This softens the domain gap and improves transfer learning performance.
Why masking only 15% of tokens help?
Masking fewer tokens means: 85% of tokens remain normal. Sentence structure is mostly realistic. [MASK] tokens are rare, not dominant. So the training distribution stays close to the fine-tuning distribution.
Why other options are INCORRECT?
Option A: INCORRECT. Masking more or fewer tokens doesn’t meaningfully change training time.
Option B: INCORRECT. Actually, masking more tokens would make copying harder. And, this doesn’t explain why only 15%.
Option D: INCORRECT. Hardness is not the goal; representation quality and transferability is.
Encoder–decoder models are great for input to output transformation (e.g., translation, summarization). Whereas, GPT-style decoder-only models are ideal for pure text generation. As scale increases, simplicity and training–inference alignment combo wins. That’s why large-scale language generation today is dominated by GPT-style architectures.
GPT-style models are decoder-only and autoregressive.
GPT-style models are decoder-only and autoregressive. That means they are trained to do one simple thing: Given the previous tokens, predict the next token. This is exactly the same thing they do at inference time when generating text.
- During training: Predict next token using only past tokens (causal / left-to-right attention).
- During inference: Predict next token using only past tokens.
No mismatch between training and generation. This perfect alignment becomes more important as models scale to billions of parameters and massive datasets, which is why GPT-style models scale so well.
Why other options are INCORRECT?
Option A: They require fewer parameters. INCORRECT. Not true. GPT models often have more parameters than encoder–decoder models like T5. Scaling success is not about being smaller.
Option B: They avoid bidirectional attention. INCORRECT. Avoiding bidirectional attention is a constraint, not the reason for better scaling. Bidirectional attention is powerful for understanding tasks, but it doesn’t help generation.
Option D: They do not use positional embeddings. INCORRECT. They absolutely do use positional information (absolute or rotary positional embeddings). So this option is factually incorrect.
Removing positional embeddings makes a Transformer permutation-invariant, so it cannot model word order.
What are positional embeddings?
Positional embeddings are vectors added to token embeddings to encode the order of tokens in a sequence. Transformers process all tokens in parallel, so unlike RNNs or LSTMs, they have no built-in sense of order. Positional embeddings fix this.
Why they are needed (simple intuition)
Take these two sentences: “dog bites man” “man bites dog” They have the same words, but different meanings. Without positional embeddings: A Transformer sees both as the same bag of words With positional embeddings: “dog” at position 1 ≠ “dog” at position 3 Order becomes meaningful.
What happens if completely removed?
A Transformer’s self-attention mechanism, by itself, does not know word order.
- Self-attention only looks at token embeddings and similarities between tokens.
- It treats the input as a set, not a sequence.
If you completely remove positional embeddings: The model cannot tell whether the input is "dog bites man" or "man bites dog". Any permutation of tokens produces the same attention pattern. So the model becomes permutation-invariant (order doesn’t matter).
Autoregressive LLMs outperform BERT on commonsense reasoning because they learn world knowledge and multi-step reasoning by continuously predicting future tokens.
What does commonsense reasoning require?
Commonsense reasoning often requires:
- Temporal flow (e.g., “what happens next?”)
- Causal reasoning (e.g., “if this happened, then what follows?”)
- Multi-step inference across several ideas
How Large Language Models Acquire Commonsense Knowledge?
Large language models (such as GPT-style models) are typically autoregressive. This means they are trained to repeatedly answer a single core question:
“Given everything so far, what comes next?”
To perform this task accurately, the model must learn patterns that go far beyond individual words. Over time, it acquires:
- How events usually unfold over time
- Cause–effect relationships in the real world
- Everyday facts and common situations
- Multi-step reasoning and inference patterns
Across billions of prediction steps, the model implicitly accumulates commonsense knowledge and learns how to chain ideas together, which is essential for reasoning tasks.
Why BERT Struggles More with Commonsense Reasoning?
BERT follows a fundamentally different training strategy:
- It is trained using masked language modeling, where random words are hidden and must be predicted.
- It uses bidirectional context, but focuses on local sentence-level understanding.
- It excels at language understanding tasks such as classification, named entity recognition (NER), and semantic similarity.
- However, it is not trained to generate long sequences or reasoning chains.
Since BERT was not optimized for sequential prediction and reasoning, it typically underperforms autoregressive models on commonsense reasoning tasks.
BERT is trained with masked language modeling for sentence-level understanding, not for sequential generation, making it weaker at multi-step reasoning tasks.
More explanation:
What is multi-step reasoning?
Multi-step reasoning is the ability to arrive at an answer by going through a sequence of intermediate logical steps, where each step depends on the previous one. Instead of jumping straight to the answer, the model (or person) has to chain several inferences together.
LLM Example: Multi-Step Reasoning
Question:
If all neural networks are models, and transformers are neural networks, what are transformers?
Reasoning Steps:
- Neural networks ⊆ models
- Transformers ⊆ neural networks
- Therefore, transformers ⊆ models
Each step builds on the previous one, illustrating how multi-step reasoning combines intermediate inferences to reach a final conclusion.
Why multi-step reasoning is crucial for language models?
Multi-step reasoning is crucial for:
- commonsense reasoning
- logical inference
- causal and temporal questions etc.
Autoregressive LLMs are better at this because they generate text step by step, naturally mirroring the reasoning process.
Multiple attention heads allow a Transformer to attend to different types of relationships between tokens simultaneously.
What multiple attention heads actually do?
In a Transformer, attention decides which other tokens a word should focus on. When we use multiple attention heads, we don’t just repeat the same attention—we let the model look at the sentence in different ways at the same time.
Multi-head attention splits the query, key, and value projections into multiple "heads," each operating in parallel on lower-dimensional subspaces of the input embeddings. This allows each head to specialize in distinct relationships—like syntactic dependencies in one head, semantic patterns in another, or positional cues in yet another—before concatenating and linearly transforming the outputs. A single head would force all relationships into one averaged attention pattern, creating an information bottleneck and limiting expressiveness.
LLMs with fixed context windows struggle with long documents because self-attention scales quadratically in compute and memory with sequence length.
What does “fixed context window” mean?
LLMs process text in chunks called context windows (for example, 2k, 8k, 32k tokens). Inside one window, every token attends to every other token using self-attention. That’s powerful — but expensive.
Why self-attention becomes a problem for long documents?
In self-attention each of n tokens compares itself with n other tokens. This creates an n × n attention matrix. So, compute cost grows as O(n2) and memory usage also grows as O(n2).
As the document gets longer:
- GPU memory fills up quickly.
- Computation becomes slow or infeasible.
- The model must truncate, slide windows, or summarize instead of reading everything.
This is the core reason long documents are hard.
Large language models (LLMs) adapt to new tasks through in-context learning, where few-shot examples in the prompt act as contextual signals that guide the model's predictions without altering its fixed parameters. The attention mechanisms process these examples alongside the input, enabling pattern recognition and task generalization during inference.
What’s Actually Happening? / What is in-context learning?
When you give a large language model (LLM) a prompt like:
"Translate English to French:dog → chien; cat → chat; house → "
The model is not learning in the usual machine learning sense. No parameters or weights inside the model are being updated.
Instead, the examples in the prompt act like temporary instructions that influence the model’s next prediction.
During pretraining, the model has learned that:
- Patterns in the recent context matter
- Earlier input–output pairs often define a task
As a result, the model treats the examples as conditioning signals:
- “The task here is translation”
- “The mapping pattern is English → French”
- “I should continue this pattern”
This phenomenon is known as in-context learning.
Transformers are universal sequence-to-sequence approximators because self-attention enables global interaction and feed-forward networks provide nonlinear expressiveness.
Explanation:Self-attention computes contextual mappings, allowing each position to weigh relationships across the entire input sequence dynamically. In simpler terms, self-attention decides what information to gather and from where.
Feed-forward networks (position-wise MLPs) provide non-linear value transformations, enabling approximation of arbitrary functions when stacked with attention. In simpler terms, feed-forward networks decide how to transform that information.
Together, they can represent any mapping from an input sequence to an output sequence, in theory, assuming sufficient depth, width, and data.
Why other options are INCORRECT?
Option A: Layer normalization. INCORRECT. Helps stabilize and speed up training, but does not increase representational capacity.
Option B: Residual connections. INCORRECT. Improve gradient flow and optimization; they don’t make the model more expressive.
Option D: Tokenization strategy. INCORRECT. A preprocessing choice, not a source of theoretical function-approximation power.