Saturday, July 3, 2021

Common preprocessing steps and their significance in information retrieval

common preprocessing steps used in information retrieval task, Significance of preprocessing in information retrieval, All you need to know about text preprocessing in information retrieval

Question:

What are the common preprocessing steps used in information retrieval task?

 

Answer:

 

Preprocessing technique

How?

Benefits

Extract root words

* Stemming (Rule-based, dictionary based, corpus based)

* Lemmatization

1. Improves recall

2. Indexing size reduced

Stop words removal

Stop word list can be used

1. Improves efficiency of retrieval

2. Indexing size reduced

Tokenization (break sentences into tokens/keywords)

Typical solution is to split a sentence at non-letter characters, mostly white spaces.

Tokens are indexed for further processing.

Normalization

* Case folding (convert all text to lower case)

* Spelling variations (have common spelling)

* Diacritics/Accent marks on letters (naïve to naive)

Randomness is reduced

Detecting common phrases

By indexing meaningful phrases

Effective retrieval by avoiding tokenizing phrases into bag-of-words

Building index

Add preprocessed terms to inverted index (it stores the list of documents in which the terms appear)

It is a lookup table to quickly find all documents containing a word.

 

 

 

Related links/questions



             

Keywords

Significance of preprocessing in information retrieval

Document preprocessing steps in information retrieval

General approach for text preprocessing

Text preprocessing in NLP

All you need to know about text preprocessing in information retrieval


No comments:

Post a Comment

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents

data recovery