Advanced Word2Vec MCQs with Answers (Skip-gram, SGNS & Softmax)

Q: Why is full softmax computationally expensive in Word2Vec?

Full softmax requires computing the normalization term across the entire vocabulary for every training example, resulting in O(|V|) time complexity. When vocabulary size is large (e.g., millions of words), this becomes computationally expensive.

Q: What does Skip-gram with Negative Sampling (SGNS) approximate?

Skip-gram with Negative Sampling has been theoretically shown to approximate factorization of a shifted Pointwise Mutual Information (PMI) matrix, which explains the emergence of semantic similarity in embedding space.

Q: Why is the negative sampling distribution raised to the power of 3/4?

Raising the unigram distribution to the power of 3/4 smooths and flattens it, reducing the dominance of very frequent words while still allowing meaningful sampling of medium-frequency words.

Q: What is a major limitation of static Word2Vec embeddings?

Static Word2Vec embeddings assign a single vector representation to each word type, making them unable to represent polysemous words with multiple context-dependent meanings.

Q: How does increasing window size affect Word2Vec embeddings?

Increasing the context window size encourages the model to capture broader topical and semantic relationships, while smaller windows focus more on syntactic similarity.

Question 1

In Skip-gram with full softmax, what is the primary computational bottleneck when vocabulary size is extremely large (e.g., 1 million words)?

Answer

Correct Answer: C

Explanation:

The denominator of the softmax requires summing over all vocabulary words. If vocabulary size is 1 million, 1 million dot products must be computed for every update, making training computationally expensive.

Full softmax requires computing the denominator over the entire vocabulary: Σ exp(v_w^T v_{w_c}) Time complexity = O(|V|) per training example.

Question 2

In negative sampling, if a negative word vector is orthogonal to the center word vector, what happens to its gradient update?

Answer

Correct Answer: C

Explanation:

If vectors are orthogonal, their dot product is zero. Since sigmoid(0) = 0.5, the gradient is small but not zero. The model still updates the vectors to push negative samples away.

Question 3

Given the subsampling probability formula P(w) = 1 - √(t / f(w)), what happens when word frequency f(w) is much larger than t?

Answer

Correct Answer: B

Explanation:

When frequency is very high, t/f(w) becomes very small, making the discard probability approach 1. Thus very frequent words like "the" are removed most of the time.

Question 4

Why does Word2Vec learn two embedding matrices (W and W') but typically use only W after training?

Answer

Correct Answer: C

Explanation:

The input matrix W represents center-word embeddings and captures semantic structure. The output matrix W' represents context embeddings and is usually discarded after training.

Question 5

If two words have nearly identical context distributions in a corpus, their Word2Vec embeddings will most likely:

Answer

Correct Answer: C

Explanation:

According to the distributional hypothesis, words appearing in similar contexts obtain similar embeddings, resulting in high cosine similarity.

Question 6

Which scenario particularly favors Skip-gram over CBOW?

Answer

Correct Answer: C

Explanation:

Skip-gram generates more training signals per word and performs better for rare words, while CBOW is generally faster and smoother for frequent words.

Question 7

Why is the negative sampling distribution raised to the power of 3/4?

Answer

Correct Answer: C

Explanation:

Raising frequencies to the power 3/4 reduces dominance of very frequent words and increases medium-frequency sampling, improving embedding quality.

Question 8

Why does the analogy king - man + woman ≈ queen work in Word2Vec?

Answer

Correct Answer: C

Explanation:

Word2Vec embeddings capture linear semantic relationships, allowing vector arithmetic to represent analogies like gender direction in embedding space.

Question 9

If neither negative sampling nor hierarchical softmax is used, training Word2Vec with full softmax becomes:

Answer

Correct Answer: C

Explanation:

Full softmax requires computation across entire vocabulary for each update, making complexity proportional to vocabulary size and thus very slow.

Question 10

Which limitation is fundamentally unavoidable in static Word2Vec embeddings trained without contextualization?

Answer

Correct Answer: B

Explanation:

Classic Word2Vec learns a single static vector representation for each word type, regardless of context. Therefore, polysemous words like “bank” (river bank vs financial bank) receive only one embedding and cannot represent different meanings based on context.

Question 11

Skip-gram with Negative Sampling (SGNS) has been theoretically shown to approximate factorization of which matrix?

Answer

Correct Answer: C

Explanation:

Research shows that Skip-gram with Negative Sampling implicitly factorizes a shifted PMI matrix. This explains why semantic similarity emerges geometrically in Word2Vec embeddings.

Question 12

Increasing the context window size in Word2Vec primarily encourages the model to capture more:

Answer

Correct Answer: C

Explanation:

Small window sizes focus on syntactic relationships, while larger window sizes capture broader topical and semantic relationships across sentences.

Question 13

In trained Word2Vec embeddings, very frequent words often tend to have:

Answer

Correct Answer: B

Explanation:

Frequent words receive many gradient updates during training, which often leads to larger embedding magnitudes compared to rare words.

Question 14

If the number of negative samples (k) is significantly increased in Skip-gram with Negative Sampling, what is the most likely effect?

Answer

Correct Answer: B

Explanation:

Increasing k improves approximation to full softmax and may enhance embedding quality, but training time increases linearly with k.

Question 15

Is cosine similarity between two Word2Vec embeddings symmetric?

Answer

Correct Answer: A

Explanation:

Cosine similarity is mathematically symmetric: cos(a, b) equals cos(b, a). This property is independent of the training architecture.

Question 16

Why does Word2Vec use the dot product between word vectors during training?

Answer

Correct Answer: B

Explanation:

The dot product measures alignment between vectors. Higher dot product increases predicted probability that two words co-occur.

Question 17

Very rare words in Word2Vec training tend to have:

Answer

Correct Answer: B

Explanation:

Rare words receive very few updates, so their embeddings are often poorly trained and unstable compared to frequent words.

Question 18

If subsampling of frequent words is completely removed, what is the most likely outcome?

Answer

Correct Answer: C

Explanation:

Without subsampling, high-frequency words appear in nearly every context and dominate gradient updates, harming semantic representation learning.

Question 19

If embedding dimensionality increases significantly (e.g., from 100 to 1000), what is the most likely effect?

Answer

Correct Answer: C

Explanation:

Higher dimensional embeddings increase representational capacity but also computational cost and risk of overfitting, especially with limited data.

Question 20

In Word2Vec, a word’s embedding primarily reflects:

Answer

Correct Answer: B

Explanation:

Word2Vec learns embeddings based on global co-occurrence patterns throughout the corpus, not on individual sentence position.

Major links

Quicklinks

Sunday, March 1, 2026

Advanced Word2Vec MCQs – Skip-gram, Negative Sampling & Softmax

Advanced Word2Vec MCQs with Answers (Skip-gram, SGNS & Softmax)

Topics Covered in These Word2Vec MCQs

Who Should Practice These Questions?

No comments:

Post a Comment

Featured Content

Multiple choice questions in Natural Language Processing Home

All time most popular contents

Report Abuse