✔ Scroll down and test yourself — answers are hidden under the “View Answer” button.
Attempt all questions first.
✔️ Click SUBMIT at the end to unlock VIEW ANSWER buttons.
Advanced Word2Vec MCQs with Answers (Skip-gram, SGNS & Softmax)
This page provides 20 advanced multiple-choice questions (MCQs) on Word2Vec covering Skip-gram, CBOW, Negative Sampling (SGNS), Full Softmax, subsampling, PMI matrix factorization, cosine similarity, and embedding theory. These questions are designed for postgraduate students, research scholars, competitive exams, and machine learning interviews.
Topics Covered in These Word2Vec MCQs
- Skip-gram vs CBOW differences
- Full Softmax computational complexity O(|V|)
- Negative Sampling and the 3/4 distribution smoothing
- Subsampling of frequent words
- Shifted PMI matrix factorization interpretation of SGNS
- Cosine similarity and embedding geometry
- Static embedding limitations (polysemy problem)
- Effect of window size and dimensionality
Who Should Practice These Questions?
These advanced Word2Vec MCQs are suitable for learners preparing for NLP exams, machine learning viva, university theory exams, research interviews, and technical placements. The explanations emphasize conceptual understanding rather than memorization.
Explanation:
The denominator of the softmax requires summing over all vocabulary words. If vocabulary size is 1 million, 1 million dot products must be computed for every update, making training computationally expensive.
Full softmax requires computing the denominator over the entire vocabulary: Σ exp(vwT vwc) Time complexity = O(|V|) per training example.
Explanation:
If vectors are orthogonal, their dot product is zero. Since sigmoid(0) = 0.5, the gradient is small but not zero. The model still updates the vectors to push negative samples away.
Explanation:
When frequency is very high, t/f(w) becomes very small, making the discard probability approach 1. Thus very frequent words like "the" are removed most of the time.
Explanation:
The input matrix W represents center-word embeddings and captures semantic structure. The output matrix W' represents context embeddings and is usually discarded after training.
Explanation:
According to the distributional hypothesis, words appearing in similar contexts obtain similar embeddings, resulting in high cosine similarity.
Explanation:
Skip-gram generates more training signals per word and performs better for rare words, while CBOW is generally faster and smoother for frequent words.
Explanation:
Raising frequencies to the power 3/4 reduces dominance of very frequent words and increases medium-frequency sampling, improving embedding quality.
Explanation:
Word2Vec embeddings capture linear semantic relationships, allowing vector arithmetic to represent analogies like gender direction in embedding space.
Explanation:
Full softmax requires computation across entire vocabulary for each update, making complexity proportional to vocabulary size and thus very slow.
Explanation:
Classic Word2Vec learns a single static vector representation for each word type, regardless of context. Therefore, polysemous words like “bank” (river bank vs financial bank) receive only one embedding and cannot represent different meanings based on context.
Explanation:
Research shows that Skip-gram with Negative Sampling implicitly factorizes a shifted PMI matrix. This explains why semantic similarity emerges geometrically in Word2Vec embeddings.
Explanation:
Small window sizes focus on syntactic relationships, while larger window sizes capture broader topical and semantic relationships across sentences.
Explanation:
Frequent words receive many gradient updates during training, which often leads to larger embedding magnitudes compared to rare words.
Explanation:
Increasing k improves approximation to full softmax and may enhance embedding quality, but training time increases linearly with k.
Explanation:
Cosine similarity is mathematically symmetric: cos(a, b) equals cos(b, a). This property is independent of the training architecture.
Explanation:
The dot product measures alignment between vectors. Higher dot product increases predicted probability that two words co-occur.
Explanation:
Rare words receive very few updates, so their embeddings are often poorly trained and unstable compared to frequent words.
Explanation:
Without subsampling, high-frequency words appear in nearly every context and dominate gradient updates, harming semantic representation learning.
Explanation:
Higher dimensional embeddings increase representational capacity but also computational cost and risk of overfitting, especially with limited data.
Explanation:
Word2Vec learns embeddings based on global co-occurrence patterns throughout the corpus, not on individual sentence position.
No comments:
Post a Comment