| Explore Database

In Skip-gram with full softmax, what is the primary computational bottleneck when vocabulary size is extremely large (e.g., 1 million words)?

A. Computing dot product between two vectors
B. Updating embedding matrices
C. Computing the normalization term over entire vocabulary
D. Gradient vanishing

Correct Answer: C

Explanation:

The denominator of the softmax requires summing over all vocabulary words. If vocabulary size is 1 million, 1 million dot products must be computed for every update, making training computationally expensive.

Full softmax requires computing the denominator over the entire vocabulary: Σ exp(v_w^T v_{w_c}) Time complexity = O(|V|) per training example.

In negative sampling, if a negative word vector is orthogonal to the center word vector, what happens to its gradient update?

A. Very large update
B. No update
C. Small but non-zero update
D. Random update

Correct Answer: C

Explanation:

If vectors are orthogonal, their dot product is zero. Since sigmoid(0) = 0.5, the gradient is small but not zero. The model still updates the vectors to push negative samples away.

Given the subsampling probability formula P(w) = 1 - √(t / f(w)), what happens when word frequency f(w) is much larger than t?

A. Word is always kept
B. Word is almost always discarded
C. No effect on sampling
D. Word weight increases

Correct Answer: B

Explanation:

When frequency is very high, t/f(w) becomes very small, making the discard probability approach 1. Thus very frequent words like "the" are removed most of the time.

Why does Word2Vec learn two embedding matrices (W and W') but typically use only W after training?

A. W' is random
B. W' contains no semantic information
C. W captures center-word semantic representation
D. W' cannot be stored

Correct Answer: C

Explanation:

The input matrix W represents center-word embeddings and captures semantic structure. The output matrix W' represents context embeddings and is usually discarded after training.

If two words have nearly identical context distributions in a corpus, their Word2Vec embeddings will most likely:

A. Be orthogonal
B. Be identical in all dimensions
C. Have high cosine similarity
D. Be randomly distributed

Correct Answer: C

Explanation:

According to the distributional hypothesis, words appearing in similar contexts obtain similar embeddings, resulting in high cosine similarity.

Which scenario particularly favors Skip-gram over CBOW?

A. Small corpus size
B. Faster training requirement
C. Learning better representations for rare words
D. Reduced memory usage

Correct Answer: C

Explanation:

Skip-gram generates more training signals per word and performs better for rare words, while CBOW is generally faster and smoother for frequent words.

Why is the negative sampling distribution raised to the power of 3/4?

A. To increase probability of rare words only
B. To eliminate frequent words
C. To smooth and flatten the unigram distribution
D. To enforce Zipf’s law strictly

Correct Answer: C

Explanation:

Raising frequencies to the power 3/4 reduces dominance of very frequent words and increases medium-frequency sampling, improving embedding quality.

Why does the analogy king - man + woman ≈ queen work in Word2Vec?

A. Model memorizes analogy pairs
B. Model encodes explicit gender labels
C. Linear semantic directions emerge in embedding space
D. Rule-based post-processing

Correct Answer: C

Explanation:

Word2Vec embeddings capture linear semantic relationships, allowing vector arithmetic to represent analogies like gender direction in embedding space.

If neither negative sampling nor hierarchical softmax is used, training Word2Vec with full softmax becomes:

A. Unstable
B. Impossible
C. Computationally infeasible at scale due to O(|V|) complexity
D. Semantically incorrect

Correct Answer: C

Explanation:

Full softmax requires computation across entire vocabulary for each update, making complexity proportional to vocabulary size and thus very slow.

10.

Which limitation is fundamentally unavoidable in static Word2Vec embeddings trained without contextualization?

A. Inability to compute cosine similarity
B. Inability to represent polysemous words with different context-dependent meanings
C. Inability to optimize via gradient descent
D. Inability to scale to large corpora

Correct Answer: B

Explanation:

Classic Word2Vec learns a single static vector representation for each word type, regardless of context. Therefore, polysemous words like “bank” (river bank vs financial bank) receive only one embedding and cannot represent different meanings based on context.

Major links

Quicklinks

Sunday, March 1, 2026

No comments:

Post a Comment

Featured Content

Multiple choice questions in Natural Language Processing Home

All time most popular contents

Report Abuse