Text Normalization MCQs (Sentence Tokenization, NSW Handling, Homograph Disambiguation)
A. Presence of uppercase words
B. Multiple periods used for emphasis (“Wait....what?”)
C. Lack of nouns
D. Sentence length variation
Explanation:
Social media uses repeated punctuation, which can mislead boundary detection.
What is Sentence Tokenization?
Sentence tokenization is the process of dividing a continuous block of text into meaningful sentence units so that each sentence can be analyzed separately in NLP tasks.
Example:
Input text: I bought apples, oranges, and bananas. Then I went to the park! It was sunny outside, wasn't it? "Let's play," said my friend.
Sentence tokenized input text:
- I bought apples, oranges, and bananas.
- Then I went to the park!
- It was sunny outside, wasn't it?
- "Let's play," said my friend.
What is informal text?
Informal text refers to language that deviates from standard grammatical and stylistic conventions. It includes colloquial expressions, slang, abbreviations, misspellings, emoticons, and non-standard syntax typically found in social media, chat messages, and casual communication.
Example 1: Hey! r u coming 2 the party 2nite?
Example 2: idk if I can do this lol… anyone else tried it?
A. Tokenizers cannot read numbers
B. NSWs break the morphological structure expected by NLP models
C. These words reduce training speed
D. NSWs are not allowed in transformer-based models
Explanation:
Models expect grammatically structured tokens; NSWs distort linguistic patterns.
What are non-standard words (NSW)?
Non-standard words (NSWs) are words in text that deviate from the formal, dictionary-defined words of a language. They often appear in informal text, social media, chat messages, or user-generated content. In NLP, NSWs need to be normalized so that models can process them correctly.
Why to normalize NSWs?
NSWs can confuse tokenizers, parsers, and embeddings if not normalized. Normalizing them improves text understanding, sentiment analysis, machine translation, and speech recognition.
Examples: standard form of non-standard words "thx", "gr8", and "u" are "thanks", "great" and "you" respectively.
A. Homographs always differ in spelling
B. Homographs occur only in informal text
C. The same surface form can represent multiple meanings or pronunciations
D. Tokenizers remove ambiguity automatically
Explanation:
Words like “lead”, “tear”, “bass” need context to resolve meaning/pronunciation.
What is Homograph?
Homographs are words that have the same spelling but differ in meaning and/or pronunciation. This creates lexical ambiguity that requires disambiguation, the process of determining which meaning or pronunciation is intended in a given context.
Example:
Homograph disambiguation is necessary because the same orthographic form (spelling) can represent multiple distinct lexical items. Consider these examples:
"lead" can mean: To guide someone (verb, one pronunciation), A metal element (noun, another pronunciation)
"read" can mean: Present tense: to look at and comprehend written text (verb, one pronunciation), Past tense: looked at and comprehended written text (verb, another pronunciation)
"bank" can mean: A financial institution (noun), The side of a river (noun), To lean laterally (verb)
A. “2day” → “today”
B. “u” → “you”
C. “lol” → “laughing out loud”
D. “idk” → “I don’t know”
Explanation:
Alphanumeric blends (digit + letters) require morphological + contextual reasoning.
What does rule-based NSW normalization do?
A rule-based NSW system uses predefined rules, patterns, and dictionaries to detect and normalize non-standard words such as abbreviations, slang, numbers, dates, emojis, contractions, and phonetic spellings.
It does not learn from data. Instead, it follows explicit rules created by humans.
- Lookup dictionaries - Eg. "gr8" to "great"
- Regular expression rules - Eg. "soooo" to "so"
- Morphological or phonetic rules - Eg. "wanna" to "want to"
- Spelling correction rules - Eg. "definately" to "definitely"
A. They always occur at the end of a paragraph
B. They are never followed by capital letters
C. They require POS tagging
D. They contain periods that are not sentence boundaries
Explanation:
Periods inside abbreviations often confuse rule-based segmenters.
A. Unit expansion
B. Lemmatization
C. Semantic chunking
D. Syntactic pruning
Explanation:
Expanding unit abbreviations is standard NSW normalization.
What is unit expansion and why it is needed?
Unit expansion is the process of converting measurement units like kg, cm, km/h, $, %, °C into their expanded, readable forms.
Example:
-
5kg→ "five kilograms" -
12°C→ "twelve degrees Celsius"
This helps systems interpret the meaning correctly, especially in speech or language models.
A. Number of characters
B. Frequency of the word in corpus
C. POS tags of surrounding words
D. Length of the sentence
Explanation:
Contextual syntactic roles strongly indicate intended meaning (“lead pipe” vs “lead the team”).
How surrounding words help in homograph disambiguation?
Homographs like bank, lead, bow, tear, bat, etc., can only be correctly interpreted when we examine the words before and after them.
Surrounding words provide syntactic cues (Part of speech of surrounding words can help in identifying the right POS of the target word using language grammar), semantic cues (meaning of the surrounding words can support choosing one sense of the target word), and topic cues (topic or domain of the sentence or text can help to disambiguate).
A. It improves semantic equivalence in downstream tasks
B. It removes punctuation automatically
C. It increases dataset size
D. It is required by all tokenizers
Explanation:
Different surface forms represent the same concept; normalization avoids semantic inconsistency.
Why normalization is needed in general?
Text normalization directly improves NLP model accuracy by standardizing noisy, inconsistent, or informal input text. This reduces vocabulary size, removes ambiguity, enhances context interpretation, and boosts the performance of downstream NLP tasks.
A. Semicolon
B. Exclamation mark
C. Period
D. Question mark
Explanation:
Periods appear in decimals, abbreviations, URLs, titles—causing boundary ambiguity.
A. “bk”
B. “pls”
C. “rt”
D. “mtg”
Explanation:
“rt” may mean “retweet”, “right”, or “route” depending on domain—needs contextual normalization. The other options "bk" stands for book, "pls" stands for please, and "mtg" stands for meeting in almost all contexts.

No comments:
Post a Comment