Text Normalization MCQs (Sentence Tokenization, NSW Handling, Homograph Disambiguation)


Introduction:
Text normalization is one of the most essential preprocessing steps in Natural Language Processing (NLP). It ensures that raw, inconsistent, and noisy text is converted into a clean and uniform format that machines can understand. Key tasks such as sentence tokenization, normalizing non-standard words (NSWs), and homograph disambiguation play a crucial role in improving model accuracy and linguistic consistency. In this post, we present 10 hot and fresh MCQs that test your understanding of these core normalization concepts—perfect for interviews, exams, and quick revision.

Text normalization example


1. Which of the following is the biggest challenge in sentence tokenization for informal text (e.g., chats, tweets)?

A. Presence of uppercase words
B. Multiple periods used for emphasis (“Wait....what?”)
C. Lack of nouns
D. Sentence length variation


Answer: B
Explanation:

Social media uses repeated punctuation, which can mislead boundary detection.

What is Sentence Tokenization?

Sentence tokenization is the process of dividing a continuous block of text into meaningful sentence units so that each sentence can be analyzed separately in NLP tasks.

Example:

Input textI bought apples, oranges, and bananas. Then I went to the park! It was sunny outside, wasn't it? "Let's play," said my friend.

Sentence tokenized input text:

  • I bought apples, oranges, and bananas.
  • Then I went to the park!
  • It was sunny outside, wasn't it?
  • "Let's play," said my friend.

What is informal text?

Informal text refers to language that deviates from standard grammatical and stylistic conventions. It includes colloquial expressions, slang, abbreviations, misspellings, emoticons, and non-standard syntax typically found in social media, chat messages, and casual communication.

Example 1Hey! r u coming 2 the party 2nite?

Example 2: idk if I can do this lol… anyone else tried it?

2. Normalizing non-standard words (NSWs) such as “gr8”, “u”, “b4” is essential mainly because:

A. Tokenizers cannot read numbers
B. NSWs break the morphological structure expected by NLP models
C. These words reduce training speed
D. NSWs are not allowed in transformer-based models


Answer: B
Explanation:

Models expect grammatically structured tokens; NSWs distort linguistic patterns.

What are non-standard words (NSW)?

Non-standard words (NSWs) are words in text that deviate from the formal, dictionary-defined words of a language. They often appear in informal text, social media, chat messages, or user-generated content. In NLP, NSWs need to be normalized so that models can process them correctly.

Why to normalize NSWs?

NSWs can confuse tokenizers, parsers, and embeddings if not normalized. Normalizing them improves text understanding, sentiment analysis, machine translation, and speech recognition.

Examples: standard form of non-standard words "thx", "gr8", and "u" are "thanks", "great" and "you" respectively.


3. Homograph disambiguation is primarily required because:

A. Homographs always differ in spelling
B. Homographs occur only in informal text
C. The same surface form can represent multiple meanings or pronunciations
D. Tokenizers remove ambiguity automatically


Answer: C
Explanation:

Words like “lead”, “tear”, “bass” need context to resolve meaning/pronunciation.

What is Homograph?

Homographs are words that have the same spelling but differ in meaning and/or pronunciation. This creates lexical ambiguity that requires disambiguation, the process of determining which meaning or pronunciation is intended in a given context.

Example:

Homograph disambiguation is necessary because the same orthographic form (spelling) can represent multiple distinct lexical items. Consider these examples:

"lead" can mean: To guide someone (verb, one pronunciation), A metal element (noun, another pronunciation)

"read" can mean: Present tense: to look at and comprehend written text (verb, one pronunciation), Past tense: looked at and comprehended written text (verb, another pronunciation)

"bank" can mean: A financial institution (noun), The side of a river (noun), To lean laterally (verb)

4. Which of the following is hardest to normalize using simple rule-based NSW expansion rules?

A. “2day” → “today”
B. “u” → “you”
C. “lol” → “laughing out loud”
D. “idk” → “I don’t know”


Answer: A
Explanation:

Alphanumeric blends (digit + letters) require morphological + contextual reasoning.

What does rule-based NSW normalization do?

A rule-based NSW system uses predefined rules, patterns, and dictionaries to detect and normalize non-standard words such as abbreviations, slang, numbers, dates, emojis, contractions, and phonetic spellings.

It does not learn from data. Instead, it follows explicit rules created by humans.

To map non-standard word to standard form of the word, a rule-based NSW typically uses;

  • Lookup dictionaries - Eg. "gr8" to "great"
  • Regular expression rules - Eg. "soooo" to "so"
  • Morphological or phonetic rules - Eg. "wanna" to "want to"
  • Spelling correction rules - Eg. "definately" to "definitely"


5. In sentence tokenization, why are abbreviations like “Dr.”, “Inc.”, “St.” a problem?

A. They always occur at the end of a paragraph
B. They are never followed by capital letters
C. They require POS tagging
D. They contain periods that are not sentence boundaries


Answer: D
Explanation:

Periods inside abbreviations often confuse rule-based segmenters.


6. During NSW normalization, converting “5km” into “5 kilometers” is an example of:

A. Unit expansion
B. Lemmatization
C. Semantic chunking
D. Syntactic pruning


Answer: A
Explanation:

Expanding unit abbreviations is standard NSW normalization.

What is unit expansion and why it is needed?

Unit expansion is the process of converting measurement units like kg, cm, km/h, $, %, °C into their expanded, readable forms.

Example:

  • 5kg"five kilograms"

  • 12°C"twelve degrees Celsius"

This helps systems interpret the meaning correctly, especially in speech or language models.

 

7. Which feature is most useful for homograph disambiguation in context?

A. Number of characters
B. Frequency of the word in corpus
C. POS tags of surrounding words
D. Length of the sentence


Answer: C
Explanation:

Contextual syntactic roles strongly indicate intended meaning (“lead pipe” vs “lead the team”).

How surrounding words help in homograph disambiguation?

Homographs like bank, lead, bow, tear, bat, etc., can only be correctly interpreted when we examine the words before and after them.

Surrounding words provide syntactic cues (Part of speech of surrounding words can help in identifying the right POS of the target word using language grammar), semantic cues (meaning of the surrounding words can support choosing one sense of the target word), and topic cues (topic or domain of the sentence or text can help to disambiguate).


8. Normalizing numbers such as "twenty-four", "24", and "twenty four" into a common form is important because:

A. It improves semantic equivalence in downstream tasks
B. It removes punctuation automatically
C. It increases dataset size
D. It is required by all tokenizers


Answer: A
Explanation:

Different surface forms represent the same concept; normalization avoids semantic inconsistency.

Why normalization is needed in general?

Text normalization directly improves NLP model accuracy by standardizing noisy, inconsistent, or informal input text. This reduces vocabulary size, removes ambiguity, enhances context interpretation, and boosts the performance of downstream NLP tasks.


9. In sentence tokenization, which punctuation mark most frequently causes false boundaries?

A. Semicolon
B. Exclamation mark
C. Period
D. Question mark


Answer: C
Explanation:

Periods appear in decimals, abbreviations, URLs, titles—causing boundary ambiguity.


10. Which non-standard word requires contextual rather than direct dictionary-based normalization?

A. “bk”
B. “pls”
C. “rt”
D. “mtg”


Answer: C
Explanation:

“rt” may mean “retweet”, “right”, or “route” depending on domain—needs contextual normalization. The other options "bk" stands for book, "pls" stands for please, and "mtg" stands for meeting in almost all contexts.