Showing posts with label NLP. Show all posts
Showing posts with label NLP. Show all posts

Sunday, February 28, 2021

What is smoothing in NLP and why do we need it

What is smoothing in the context of natural language processing, define smoothing in NLP, what is the purpose of smoothing in nlp, is smoothing an important task in language model

Smoothing in NLP

Smoothing is the process of flattening a probability distribution implied by a language model so that all reasonable word sequences can occur with some probability. This often involves broadening the distribution by redistributing weight from high probability regions to zero probability regions.

Smoothing not only prevents zero probabilities, attempts to improves the accuracy of the model as a whole.

Why do we need smoothing?

In a language model, we use parameter estimation (MLE) on training data. We can’t actually evaluate our MLE models on unseen test data because both are likely to contain words/n-grams that these models assign zero probability to. Relative frequency estimation assigns all probability mass to events in the training corpus. But we need to reserve some probability mass to events that don’t occur (unseen events) in the training data.

Example:

Training data: The cow is an animal.

Test data: The dog is an animal.

If we use unigram model to train;

P(the) = count(the)/(Total number of words in training set) = 1/5.

Likewise, P(cow) = P(is) = P(an) = P(animal) = 1/5

To evaluate (test) the unigram model;

P(the cow is an animal) = P(the) * P(cow) * P(is) * P(an) * P(animal) = 0.00032

 

While we use unigram model on the test data, it becomes zero because P(dog) = 0. The term ‘dog’ never occurred in the training data. Hence, we use smoothing.

 

****************

Explain the concept of smoothing in NLP

Why do we need smoothing

What is the advantage of smoothing the data in language models


Related posts:


 

 

Friday, December 18, 2020

What is lemmatization in natural language processing

What is lemmatization in NLP? Define lemmatization, Lemmatization example

Lemmatization

In a language, usually a word is inflected to form new words, especially to mark the distinctions such as tense, person, number, gender, mood, voice, and case. In linguistics, lemmatization is the process of removing those inflections from a word in order to identify the lemma (dictionary form/word). A dictionary word (lemma / root word) is inflected into various words having same base meaning or different meanings by adding one or more morphemes (both free and bound). Through lemmatization, we remove the bound morphemes.

Lemmatization refers to doing things algorithmically with the use of a vocabulary and morphological analysis of words, aiming to remove inflections only and to return the base or dictionary form of a word, which is known as the lemma.

Inflected word Removal of morphemes Lemma

Example:

Inflected word

Morphemes

Lemma

Runs

‘s’

Run

Studies

‘ies’

Study

Opened

‘ed’

Open

 

******************

Go to Natural Language Processing Home page

 

Define lemmatization

What is lemmatization

What is lemma

Friday, June 12, 2020

Natural Language Processing MCQ 12

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers, language model quiz questions, MLE in NLP


Multiple Choice Questions and Answers in NLP Set - 12


1. Assume a corpus with 350 tokens in it. We have 20 word types in that corpus (V = 20). The frequency (unigram count) of word types “short” and “fork” are 25 and 15 respectively. If we are using the Laplace smoothing, which of the following is PLaplace(“fork”)?

(a) 15/350

(b) 16/370

(c) 30/350

(d) 31/370


View Answer

Answer: (b) 16/370

In Laplace smoothing (also called as Add-1 smoothing), we find the probability by adding 1 with the numerator and V with the denominator. This is to ensure that frequency of each word of the corpus is added with 1.

P(w) = [count(w)+1] / [count(tokens)+V] = 16/370

 

2. When training a language model, if we use an overly narrow corpus, the probabilities

(a) Don’t reflect the task

(b) Reflect all possible wordings

(c) Reflect intuition

(d) Don’t generalize


View Answer

Answer: (d) Don’t generalize

Due to the output of LMs being dependent on the training corpus, N-grams only work well for word prediction if the test corpus looks like the training corpus. Hence, if the training corpus is overly narrow corpus, the probabilities don’t generalize.

 

3. The difference(s) between generative models and discriminative models include(s)

(a) Discriminative models capture the joint distribution between features and class labels

(b) Generative models assume conditional independence among features

(c) Generative models can effectively explore unlabeled data

(d) Discriminative models provide more flexibility in introducing features.


View Answer

Answer: (c) and (d)

Generative models can effectively explore unlabeled data.

Discriminative models provide more flexibility in introducing features.

 

4. Assume that there are 10000 documents in a collection. Out of these, 50 documents contain the terms “difficult task”. If “difficult task” appears 3 times in a particular document, what is the TFIDF value of the terms for that document?

(a) 8.11

(b) 15.87

(c) 0

(d) 81.1


View Answer

Answer: (b) 15.9

IDF = log(total no. of docs/no. of docs with given terms) = log(10000/50) = 5.29

TFIDF = given term’s frequency in a doc * IDF = 3 * 5.29 = 15.87

 

5. Let us suppose that you have the following two 4-dimensional word vectors for two words w1 and w2 respectively:

w1 = (0.2, 0.1, 0.3, 0.4) and w2 = (0.3, 0, 0.2, 0.5)

What is the cosine similarity between w1 and w2?

(a) 0.948

(b) 0.832

(c) 0

(d) 0.5


View Answer

Answer: (a) 0.948

Cosine similarity can be calculated as follows;

cosine similarity
 

For the given problem, n=4. w1.w2 is the dot product which can be expanded for our data as follows;

w1.w2 = (0.2 * 0.3) + (0.1 * 0) + (0.3 * 0.2) + (0.4 * 0.5)

 

*************


Top interview questions in NLP

NLP quiz questions with answers explained

Bigram and trigram language models

Online NLP quiz with solutions

how to find similarity between two or more documents

MCQ important questions and answers in natural language processing

important quiz questions in nlp for placement

Cosine similarity between documents

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents

data recovery