Multiple choices questions in NLP, Natural Language Processing solved MCQ, Bigram model, How to calculate the bigram probability using a corpus statistics? maximum likelihood estimate to find the bigram probability
Natural Language Processing MCQ - Bigram probability calculation using MLE
Next > |
1. Using Maximum Likelihood Estimate (MLE), to compute the bigram probability P(w_{n}|w_{n-1}), we need to count the number of bigrams (w_{n-1}w_{n}) from a corpus and normalize by the count of all bigrams that start with w_{n-1}. This normalization step ensures that the estimate lie between 0 and 1.
P(w_{n}|w_{n-1}) = Count (w_{n-1}w_{n}) / Sum(Count(w_{n-1}w))
Here, w is any word that follows w_{n-1}.
This equation can be simplified by replacing the bigram count in the denominator with the unigram count of w_{n-1}. Why do we want to do that?
a) Bigram count can only be normalized by unigram count
b) Sum of all bigram counts that start with the word wn-1 is equal to the unigram count of the same word
c) Normalization using bigram count will make the estimate to be greater than 1 in some cases.
d) None of the above.
Answer: (b) Sum of all bigram counts that start with the word w_{n-1} is equal to the unigram count of the same word Let us calculate the bigram probability P(increase | to) using both the normalization using bigram and unigram. (Note: hereafter I use ‘C’ to refer ‘Count’)
Normalizing by sum of all bigram counts
For this case, we need to normalize using the total count of bigrams that start with the word “to”.
P(increase | to) = C(“to increase”)/[C(“to increase”)+C(“to be”)+C(“to fill”)] = 2/[2+1+1] = 2/4 = 0.5
Normalizing by unigram count
For this case, we need to normalize using the unigram count of the same word “to”.
P(increase|to) = C(“to increase”)/C(“to”) = 2/4 = 0.5
We have only 4 occurrences of word “to” in the corpus. Hence, the sum of count of any bigram that starts with “to” cannot exceed 4. For this reason, we can simplify the equation by normalizing using unigram count instead of sum of all bigram counts. |
Next > |