Text Normalization MCQs (Sentence Tokenization, NSW Handling, Homograph Disambiguation)


Introduction:
Text normalization is one of the most essential preprocessing steps in Natural Language Processing (NLP). It ensures that raw, inconsistent, and noisy text is converted into a clean and uniform format that machines can understand. Key tasks such as sentence tokenization, normalizing non-standard words (NSWs), and homograph disambiguation play a crucial role in improving model accuracy and linguistic consistency. In this post, we present 10 hot and fresh MCQs that test your understanding of these core normalization concepts—perfect for interviews, exams, and quick revision.

Text normalization example


1. Which of the following is the biggest challenge in sentence tokenization for informal text (e.g., chats, tweets)?

A. Presence of uppercase words
B. Multiple periods used for emphasis (“Wait....what?”)
C. Lack of nouns
D. Sentence length variation


Answer: B
Explanation:

Social media uses repeated punctuation, which can mislead boundary detection.

What is Sentence Tokenization?

Sentence tokenization is the process of dividing a continuous block of text into meaningful sentence units so that each sentence can be analyzed separately in NLP tasks.

Example:

Input textI bought apples, oranges, and bananas. Then I went to the park! It was sunny outside, wasn't it? "Let's play," said my friend.

Sentence tokenized input text:

  • I bought apples, oranges, and bananas.
  • Then I went to the park!
  • It was sunny outside, wasn't it?
  • "Let's play," said my friend.

What is informal text?

Informal text refers to language that deviates from standard grammatical and stylistic conventions. It includes colloquial expressions, slang, abbreviations, misspellings, emoticons, and non-standard syntax typically found in social media, chat messages, and casual communication.

Example 1Hey! r u coming 2 the party 2nite?

Example 2: idk if I can do this lol… anyone else tried it?

2. Normalizing non-standard words (NSWs) such as “gr8”, “u”, “b4” is essential mainly because:

A. Tokenizers cannot read numbers
B. NSWs break the morphological structure expected by NLP models
C. These words reduce training speed
D. NSWs are not allowed in transformer-based models


Answer: B
Explanation:

Models expect grammatically structured tokens; NSWs distort linguistic patterns.

What are non-standard words (NSW)?

Non-standard words (NSWs) are words in text that deviate from the formal, dictionary-defined words of a language. They often appear in informal text, social media, chat messages, or user-generated content. In NLP, NSWs need to be normalized so that models can process them correctly.

Why to normalize NSWs?

NSWs can confuse tokenizers, parsers, and embeddings if not normalized. Normalizing them improves text understanding, sentiment analysis, machine translation, and speech recognition.

Examples: standard form of non-standard words "thx", "gr8", and "u" are "thanks", "great" and "you" respectively.


3. Homograph disambiguation is primarily required because:

A. Homographs always differ in spelling
B. Homographs occur only in informal text
C. The same surface form can represent multiple meanings or pronunciations
D. Tokenizers remove ambiguity automatically


Answer: C
Explanation:

Common Crawl is a massive web-scraped dataset used to train large-scale transformer models.

4. Which of the following is hardest to normalize using simple rule-based NSW expansion rules?

A. “2day” → “today”
B. “u” → “you”
C. “lol” → “laughing out loud”
D. “idk” → “I don’t know”


Answer: A
Explanation:

Alphanumeric blends (digit + letters) require morphological + contextual reasoning.


5. In sentence tokenization, why are abbreviations like “Dr.”, “Inc.”, “St.” a problem?

A. They always occur at the end of a paragraph
B. They are never followed by capital letters
C. They require POS tagging
D. They contain periods that are not sentence boundaries


Answer: D
Explanation:

Periods inside abbreviations often confuse rule-based segmenters.


6. During NSW normalization, converting “5km” into “5 kilometers” is an example of:

A. Unit expansion
B. Lemmatization
C. Semantic chunking
D. Syntactic pruning


Answer: A
Explanation:

Expanding unit abbreviations is standard NSW normalization.

 

7. Which feature is most useful for homograph disambiguation in context?

A. Number of characters
B. Frequency of the word in corpus
C. POS tags of surrounding words
D. Length of the sentence


Answer: C
Explanation:

Contextual syntactic roles strongly indicate intended meaning (“lead pipe” vs “lead the team”).


8. Normalizing numbers such as "twenty-four", "24", and "twenty four" into a common form is important because:

A. It improves semantic equivalence in downstream tasks
B. It removes punctuation automatically
C. It increases dataset size
D. It is required by all tokenizers


Answer: A
Explanation:

Different surface forms represent the same concept; normalization avoids semantic inconsistency.


9. In sentence tokenization, which punctuation mark most frequently causes false boundaries?

A. Semicolon
B. Exclamation mark
C. Period
D. Question mark


Answer: C
Explanation:

Periods appear in decimals, abbreviations, URLs, titles—causing boundary ambiguity.


10. Which non-standard word requires contextual rather than direct dictionary-based normalization?

A. “bk”
B. “pls”
C. “rt”
D. “mtg”


Answer: C
Explanation:

“rt” may mean “retweet”, “right”, or “route” depending on domain—needs contextual normalization.