Advanced Word2Vec MCQs with Answers (Skip-gram, SGNS & Softmax)

Question 1

In Skip-gram with full softmax, what is the primary computational bottleneck when vocabulary size is extremely large (e.g., 1 million words)?

Answer

Correct Answer: C

Explanation:

The denominator of the softmax requires summing over all vocabulary words. If vocabulary size is 1 million, 1 million dot products must be computed for every update, making training computationally expensive.

Full softmax requires computing the denominator over the entire vocabulary: Σ exp(v_w^T v_{w_c}) Time complexity = O(|V|) per training example.

Question 2

In negative sampling, if a negative word vector is orthogonal to the center word vector, what happens to its gradient update?

Answer

Correct Answer: C

Explanation:

If vectors are orthogonal, their dot product is zero. Since sigmoid(0) = 0.5, the gradient is small but not zero. The model still updates the vectors to push negative samples away.

Question 3

Given the subsampling probability formula P(w) = 1 - √(t / f(w)), what happens when word frequency f(w) is much larger than t?

Answer

Correct Answer: B

Explanation:

When frequency is very high, t/f(w) becomes very small, making the discard probability approach 1. Thus very frequent words like "the" are removed most of the time.

Question 4

Why does Word2Vec learn two embedding matrices (W and W') but typically use only W after training?

Answer

Correct Answer: C

Explanation:

The input matrix W represents center-word embeddings and captures semantic structure. The output matrix W' represents context embeddings and is usually discarded after training.

Question 5

If two words have nearly identical context distributions in a corpus, their Word2Vec embeddings will most likely:

Answer

Correct Answer: C

Explanation:

According to the distributional hypothesis, words appearing in similar contexts obtain similar embeddings, resulting in high cosine similarity.

Question 6

Which scenario particularly favors Skip-gram over CBOW?

Answer

Correct Answer: C

Explanation:

Skip-gram generates more training signals per word and performs better for rare words, while CBOW is generally faster and smoother for frequent words.

Question 7

Why is the negative sampling distribution raised to the power of 3/4?

Answer

Correct Answer: C

Explanation:

Raising frequencies to the power 3/4 reduces dominance of very frequent words and increases medium-frequency sampling, improving embedding quality.

Question 8

Why does the analogy king - man + woman ≈ queen work in Word2Vec?

Answer

Correct Answer: C

Explanation:

Word2Vec embeddings capture linear semantic relationships, allowing vector arithmetic to represent analogies like gender direction in embedding space.

Question 9

If neither negative sampling nor hierarchical softmax is used, training Word2Vec with full softmax becomes:

Answer

Correct Answer: C

Explanation:

Full softmax requires computation across entire vocabulary for each update, making complexity proportional to vocabulary size and thus very slow.

Question 10

Which limitation is fundamentally unavoidable in static Word2Vec embeddings trained without contextualization?

Answer

Correct Answer: B

Explanation:

Classic Word2Vec learns a single static vector representation for each word type, regardless of context. Therefore, polysemous words like “bank” (river bank vs financial bank) receive only one embedding and cannot represent different meanings based on context.

Question 11

Skip-gram with Negative Sampling (SGNS) has been theoretically shown to approximate factorization of which matrix?

Answer

Correct Answer: C

Explanation:

Research shows that Skip-gram with Negative Sampling implicitly factorizes a shifted PMI matrix. This explains why semantic similarity emerges geometrically in Word2Vec embeddings.

Question 12

Increasing the context window size in Word2Vec primarily encourages the model to capture more:

Answer

Correct Answer: C

Explanation:

Small window sizes focus on syntactic relationships, while larger window sizes capture broader topical and semantic relationships across sentences.

Question 13

In trained Word2Vec embeddings, very frequent words often tend to have:

Answer

Correct Answer: B

Explanation:

Frequent words receive many gradient updates during training, which often leads to larger embedding magnitudes compared to rare words.

Question 14

If the number of negative samples (k) is significantly increased in Skip-gram with Negative Sampling, what is the most likely effect?

Answer

Correct Answer: B

Explanation:

Increasing k improves approximation to full softmax and may enhance embedding quality, but training time increases linearly with k.

Question 15

Is cosine similarity between two Word2Vec embeddings symmetric?

Answer

Correct Answer: A

Explanation:

Cosine similarity is mathematically symmetric: cos(a, b) equals cos(b, a). This property is independent of the training architecture.

Question 16

Why does Word2Vec use the dot product between word vectors during training?

Answer

Correct Answer: B

Explanation:

The dot product measures alignment between vectors. Higher dot product increases predicted probability that two words co-occur.

Question 17

Very rare words in Word2Vec training tend to have:

Answer

Correct Answer: B

Explanation:

Rare words receive very few updates, so their embeddings are often poorly trained and unstable compared to frequent words.

Question 18

If subsampling of frequent words is completely removed, what is the most likely outcome?

Answer

Correct Answer: C

Explanation:

Without subsampling, high-frequency words appear in nearly every context and dominate gradient updates, harming semantic representation learning.

Question 19

If embedding dimensionality increases significantly (e.g., from 100 to 1000), what is the most likely effect?

Answer

Correct Answer: C

Explanation:

Higher dimensional embeddings increase representational capacity but also computational cost and risk of overfitting, especially with limited data.

Question 20

In Word2Vec, a word’s embedding primarily reflects:

Answer

Correct Answer: B

Explanation:

Word2Vec learns embeddings based on global co-occurrence patterns throughout the corpus, not on individual sentence position.

Question 21

What is the primary purpose of Retrieval-Augmented Generation (RAG)?

Answer

Correct Answer: B

Explanation:

Retrieval-Augmented Generation (RAG) is designed to improve the factual accuracy and relevance of large language model outputs by providing external knowledge at inference time. Instead of modifying the model’s internal weights, RAG retrieves semantically relevant documents (using embeddings and vector search) based on the user’s query and includes this information in the prompt.

This approach is particularly useful when:

The knowledge base is large or frequently updated
The information is domain-specific or private
Retraining the model is expensive or impractical

Unlike fine-tuning, RAG keeps the model unchanged and separates knowledge storage from model learning, making it scalable and flexible for real-world applications such as enterprise search, website assistants, and documentation chatbots.

Question 22

Which component of a large language model is modified during fine-tuning?

Answer

Correct Answer: C

Explanation:

Fine-tuning involves updating the internal weights and parameters of a pretrained large language model using additional domain-specific or task-specific training data. This process adjusts how the model represents language patterns internally, allowing it to better perform a targeted task or adopt a specific behavioral style.

During fine-tuning:

The model undergoes additional gradient updates using supervised training data
Parameters are modified to reflect domain knowledge or output preferences
The learned changes become permanently embedded in the model

It is important to note that fine-tuning does not modify the context window size, the external document store and the prompt template.

Unlike RAG, which retrieves knowledge dynamically at inference time, fine-tuning encodes knowledge directly into model weights. This makes it suitable for:

Style control (formal, academic, conversational)
Structured output formatting
Task-specific behavior alignment

However, incorporating new factual knowledge through fine-tuning requires retraining, which can be computationally expensive and time-consuming.

Question 23

Which approach is most suitable for handling knowledge that changes frequently, such as company policies or product catalogs?

Answer

Correct Answer: B

Explanation:

RAG is specifically designed to retrieve external knowledge dynamically at inference time. This makes it highly suitable for domains where information changes frequently, such as policy updates, pricing data, inventory details, or regulatory documents.

In contrast, fine-tuning embeds knowledge into the model’s weights. If the knowledge changes, the model must be retrained, which is computationally expensive and operationally inefficient.

RAG allows organizations to:

Update documents in the knowledge base without retraining
Maintain separation between knowledge storage and model reasoning
Scale easily as data grows

Therefore, for dynamic and evolving knowledge environments, RAG is the preferred and scalable solution.

Question 24

A company wants its chatbot to consistently respond in a formal legal tone with structured output formatting. Which method is most appropriate?

Answer

Correct Answer: B

Explanation:

Fine-tuning modifies the model’s internal parameters to align its behavior with specific stylistic, structural, or task requirements. If a chatbot must consistently generate responses in a formal legal tone with defined output formatting, fine-tuning provides long-term behavioral alignment.

RAG, on the other hand, focuses on retrieving factual information. While it can improve knowledge accuracy, it does not guarantee stylistic consistency across responses.

Fine-tuning is ideal when:

Output style must remain consistent
Responses follow a predefined template
Task-specific reasoning behavior is required

Thus, for tone control and structural alignment, fine-tuning is the most appropriate method.

Question 25

What is a major limitation of fine-tuning when compared to RAG?

Answer

Correct Answer: B

Explanation:

Fine-tuning embeds knowledge directly into the model’s parameters. While this can improve task performance and stylistic alignment, updating knowledge requires retraining the model with new data.

Retraining is:

Computationally expensive
Time-consuming
Operationally complex

In contrast, RAG allows immediate knowledge updates by simply modifying the external document store. No retraining is required. This makes RAG significantly more flexible in rapidly evolving domains.

Question 26

In a RAG pipeline, what is the primary function of embeddings?

Answer

Correct Answer: C

Explanation:

Embeddings transform textual data into high-dimensional numerical vectors that capture semantic meaning. In a RAG system, both user queries and documents are converted into embeddings.

The system then performs similarity search to identify documents whose embeddings are closest to the query embedding. This enables contextually relevant retrieval beyond simple keyword matching.

Thus, embeddings are the core mechanism enabling semantic retrieval in RAG architectures.

Question 27

Why do many production systems combine RAG and fine-tuning?

Answer

Correct Answer: B

Explanation:

RAG provides up-to-date factual knowledge by retrieving external documents, while fine-tuning aligns the model’s internal reasoning style and output structure.

Combining both allows systems to:

Deliver accurate, grounded responses
Maintain stylistic consistency
Align with domain-specific requirements

This hybrid approach is increasingly common in enterprise AI deployments.

Question 28

Why is RAG considered more scalable for enterprise knowledge management?

Answer

Correct Answer: C

Explanation:

RAG decouples knowledge from the model itself. Documents are stored externally in databases or vector stores, allowing independent updates without retraining the model.

This separation:

Improves scalability
Reduces maintenance cost
Supports large and evolving knowledge bases

For enterprises managing thousands of documents, this architecture is significantly more efficient than embedding knowledge into model weights.

Question 29

Which statement best describes the cost trade-off between RAG and fine-tuning?

Answer

Correct Answer: C

Explanation:

Fine-tuning requires computational resources for training, dataset preparation, and validation. These costs occur upfront.

RAG avoids retraining but introduces ongoing infrastructure requirements such as:

Embedding generation
Vector database maintenance
Retrieval computation during inference

Therefore, each method has different cost dynamics depending on system scale and usage patterns.

Question 30

How does RAG help reduce hallucinations compared to fine-tuning?

Answer

Correct Answer: B

Explanation:

Hallucination occurs when a model generates plausible but incorrect information. RAG mitigates this by supplying retrieved documents as grounding evidence.

Because the model generates responses conditioned on real retrieved content, factual reliability improves. However, RAG does not eliminate hallucinations completely — it reduces them by anchoring responses in external knowledge.

Fine-tuning improves behavior and task alignment but does not inherently guarantee grounding in external evidence.

Question 31

In dependency parsing, what does a dependency relation represent?

Answer

Correct Answer: B

Dependency parsing models syntactic structure using directed relations between a head word and its dependent.

A dependency relation shows how one word (dependent) is grammatically connected to another word (head). Each relation answers the questions (a) Which word depends on which? and (b) What is their grammatical role? (subject, object, modifier, etc.)

Example

Sentence:
“She drives a Mercedes-Benz C-Class.”

Dependency Relations:

drives → root (main verb)
She → dependent of drives (subject → nsubj)
C-Class → dependent of drives (object → obj)
a → determiner of C-Class (det)
Mercedes-Benz → compound modifier of C-Class (compound)

So the structure is:

drives → She (nsubj)
drives → C-Class (obj)
C-Class → a (det)
C-Class → Mercedes-Benz (compound)

Each arrow represents a dependency relation between a head word and its dependent.

Why other options are INCORRECT?

Option A: Relationship between phrases. This is constituency parsing, not dependency parsing.
Option C: Relationship between sentences. Dependency parsing works within a sentence.
Option D: Relationship between characters. Parsing operates at the word level, not characters.

Question 32

Which property must every valid dependency tree satisfy?

Answer

Correct Answer: C

A dependency structure forms a tree: one root, no cycles, and every word has exactly one head except the root (root has no head).

Key Property: Single-Head Constraint

In a valid dependency tree:

Every word must have exactly one head, except the root word, which has no head.
This is called the single-head property.
One word (usually the main verb) is the root.
Every other word is connected to only one parent (head).

This constraint ensures that the structure forms a tree, not a graph with multiple parents.

Why other options are INCORRECT?

Option A: Multiple roots. A valid dependency tree must have exactly one root, not multiple.
Option B: Cycles allowed. Dependency trees must be acyclic. Cycles (A → B → A) are not allowed.
Option D: All dependencies must be projective. Projectivity is desirable but not required. Some valid trees are non-projective, especially in free-word-order languages.

Question 33

Which transition-based action adds a dependency and removes the dependent from the stack?

Answer

Correct Answer: A

In transition-based dependency parsing, a parser builds a dependency tree step-by-step using:

a stack
a buffer
a set of dependency arcs

Each action changes these structures.

LEFT-ARC creates a dependency where the top stack element becomes dependent and is removed.

What does LEFT-ARC do?

The LEFT-ARC action:

Creates a dependency.
The word at the top of the stack becomes the dependent.
The word at the front of the buffer becomes its head.
Removes the dependent from the stack.

So the LEFT-ARC operation adds a dependency and removes the dependent from the stack.

Example application of LEFT-ARC operation:

Stack: [She]
Buffer: [drives, a, car]

Apply LEFT-ARC:

Create: drives → She (nsubj)
Remove She from the stack

New state:

Stack: []
Buffer: [drives, a, car]

Why other options are INCORRECT?

Option B: SHIFT. Moves a word from buffer to stack. No dependency created.
Option C: REDUCE without arc. Removes a word from the stack but does not create a dependency.
Option D: SWAP. Used for handling non-projective structures. Does not directly add a dependency.

Question 34

What is the main advantage of graph-based dependency parsing?

Answer

Correct Answer: B

Graph-based parsers score possible edges and select the highest-scoring tree globally, reducing greedy errors.

Main Advantage of Graph-Based Parsing

In graph-based dependency parsing:

The sentence is viewed as a graph.
Words are treated as nodes.
Possible dependencies between words are treated as edges with scores.

The parser:

Assigns a score to each possible dependency.
Searches for the entire dependency tree with the highest total score.

This means:
The parser makes global decisions, considering the whole sentence at once. This process is called global optimization.

Why this is an advantage

Avoids errors caused by early local decisions.
Finds the best overall tree for the sentence.
Provides better accuracy for complex sentence structures.

Why other options are INCORRECT?

Option A: Linear-time parsing. Transition-based parsers are typically faster and closer to linear time.
Option C: No training required. Graph-based parsers are machine learning models and require training.
Option D: Works only for short sentences. They work for sentences of any length (though computation increases).

Question 35

Which condition indicates a non-projective dependency tree?

Answer

Correct Answer: C

A projective dependency tree is one where the dependency structure can be drawn above the sentence without any crossing lines. If any dependency arcs cross each other, the tree is called non-projective.

What is Projectivity?

In a projective dependency tree:

Dependencies follow the word order of the sentence.
If a head is connected to a dependent, all words between them must also belong to that head’s subtree.
No dependency arcs cross when the tree is drawn above the sentence.

When is a tree non-projective?

A dependency tree is non-projective when:

At least two dependency arcs cross each other.

This usually happens in:

Free word-order languages
Long-distance dependencies
Certain constructions such as topicalization or scrambling

Why other options are INCORRECT?

Option A and D are violating tree property. B is incorrect because labels are optional for the structure.

Question 36

Which algorithm is commonly used to find the highest-scoring tree in graph-based dependency parsing?

Answer

Correct Answer: D

Graph-based parsers often use MST algorithms (e.g., Chu–Liu/Edmonds) to find the optimal dependency tree.

In graph-based dependency parsing, the goal is to find the best dependency tree for a sentence. This problem is naturally solved using a Maximum Spanning Tree (MST) algorithm.

Why Maximum Spanning Tree (MST) is suitable?

Ensures a valid tree structure - MST guarantees one root, each word has exactly one headn, the structure is connected, and no cycles. These are the required properties of a dependency tree.
Global optimization - Instead of making local decisions, MST considers all possible dependencies and finds the globally best tree with the highest total score
Efficient algorithms exist (e.g., Chu–Liu/Edmonds)

Why other options are INCORRECT?

Option A Viterbi is used for sequence labeling problem, B (CKY algorithm) is used for constituency parsing and C (Beam search) approximates the best structure but does not guarantee the globally optimal tree.

Question 37

What is the time complexity of standard transition-based dependency parsing?

Answer

Correct Answer: A

Standard transition-based dependency parsing runs in O(n) time because it processes each word with a constant number of transitions (shift and arc operations), resulting in a linear number of steps relative to sentence length, hence linear-time complexity.

More explanation:

Transition-based dependency parsing builds a dependency tree by processing a sentence word by word using a sequence of simple actions (called transitions). Common transitions are: SHIFT (move a word from buffer to stack), LEFT-ARC (create a dependency and remove one word), and RIGHT-ARC (create a dependency and remove one word)

Why the Complexity is O(n)

Let n be the number of words in the sentence.

Each word is shifted once from the buffer to the stack → n operations.

Each word also participates in:

At most one Left-Arc
At most one Right-Arc

Therefore, the total number of transitions is proportional to the sentence length:

Total transitions ≤ 2n (or a small constant × n)

Since:

Each transition takes constant time: O(1)
Total number of transitions = O(n)

Therefore,

Total time complexity = O(n)

Question 38

What does Unlabeled Attachment Score (UAS) measure?

Answer

Correct Answer: B

Unlabeled Attachment Score (UAS) measures the percentage of words that are assigned the correct head in the dependency tree, regardless of the dependency label.

More explanation:

In dependency parsing, each word in a sentence is assigned a head (the word it depends on), and a dependency label (the type of relationship, e.g., subject, object, modifier). To evaluate a parser, we check how many of these predictions are correct.

What is Unlabeled Attachment Score (UAS)?

Unlabeled Attachment Score (UAS) measures the percentage of words for which a dependency parser predicts the correct head, while ignoring the dependency label.

It evaluates (equation below) only the correctness of the syntactic structure (i.e., who depends on whom).

UAS = (Number of words with correct head / Total number of words) × 100

Key Points:

Only the head prediction is evaluated.
The dependency relation type (label) is not considered.
UAS measures how accurately the parser captures the sentence structure.

Question 39

Which action is required in transition systems to handle non-projective structures?

Answer

Correct Answer: C

The SWAP operation allows reordering elements to handle crossing dependencies in non-projective parsing.

Question 40

Why are contextual embeddings (e.g., BERT) useful in modern dependency parsers?

Answer

Correct Answer: D

Traditional word embeddings (like Word2Vec or GloVe) assign one fixed vector per word, regardless of context. However, in real language, the meaning and syntactic role of a word often depend on the surrounding words.

Contextual embeddings like BERT improve dependency parsing by providing context-aware word representations that capture syntactic roles and relationships within a sentence.

What are Contextual Embeddings?

Contextual embeddings are word representations that change depending on the surrounding context. Models like BERT generate different vectors for the same word based on how it is used in a sentence.

Example:

“She sat on the bank of the river.” (bank = river side)
“He went to the bank to withdraw money.” (bank = financial institution)

BERT produces different representations for the word bank in each of the above sentences using the other words present in the context (sentence).

Why This Helps Dependency Parsing?

Dependency parsing requires understanding the grammatical roles (subject, object, modifier), long-distance relationships between words, and ambiguity resolution

Contextual embeddings help because they:

Capture syntactic role based on context - “book” in “book a ticket” (verb). “book” in “read a book” (noun)
Provide information about surrounding words - This helps the parser predict the correct head and dependency relation.
Handle ambiguity and long-range dependencies, which is important for complex sentences.

Why other options are INCORRECT?

Option A: They eliminate the need for tree structures. Parsers still produce dependency trees.
Option B: They reduce sentence length. Embeddings don’t change input size.
Option C: They replace dependency labels. Labels are still predicted by the parser.

Aspect	RAG	Fine-Tuning
How it works	Retrieves relevant external documents at query time and uses them as context	Updates the model’s internal weights using additional training data
Knowledge updates	Easy – just update the document database	Difficult – requires retraining the model
Best for	Frequently changing or large knowledge bases	Consistent tone, style, or task-specific behavior
Infrastructure	Requires embeddings and a vector database	Requires training data and computational resources
Knowledge storage	External (documents, databases)	Internal (model parameters)
Use cases	Chatbots with company knowledge, website assistants, enterprise search	Structured outputs, domain-specific writing style, instruction alignment

Major links

Quicklinks

Sunday, March 1, 2026

Advanced Word2Vec MCQs – Skip-gram, Negative Sampling & Softmax

Advanced Word2Vec MCQs with Answers (Skip-gram, SGNS & Softmax)

Topics Covered in These Word2Vec MCQs

Who Should Practice These Questions?

Thursday, February 19, 2026

RAG vs Fine-Tuning MCQs (Top 10 with Detailed Explanations) – Generative AI Guide

RAG vs Fine-Tuning: Top 10 MCQs with Detailed Explanations

RAG vs Fine-Tuning: A Simple Comparison

Practice Questions on RAG vs Fine-Tuning

Tuesday, February 17, 2026

Top 10 Dependency Parsing MCQs (With Detailed Explanations) | NLP Practice Quiz Questions and Answers

Top Graph-Based Dependency Parsing MCQs - Quiz Questions and Detailed Explanations

Why other options are INCORRECT?

Why other options are INCORRECT?

Why other options are INCORRECT?

Why other options are INCORRECT?

What is Projectivity?

When is a tree non-projective?

Why other options are INCORRECT?

Why Maximum Spanning Tree (MST) is suitable?

Why other options are INCORRECT?

Why the Complexity is O(n)

What is Unlabeled Attachment Score (UAS)?

What are Contextual Embeddings?

Why This Helps Dependency Parsing?

Contextual embeddings help because they:

Why other options are INCORRECT?

Featured Content

Multiple choice questions in Natural Language Processing Home

All time most popular contents

Report Abuse