Text Normalization MCQs (Sentence Tokenization, NSW Handling, Homograph Disambiguation)
A. Presence of uppercase words
B. Multiple periods used for emphasis (“Wait....what?”)
C. Lack of nouns
D. Sentence length variation
Explanation:
Social media uses repeated punctuation, which can mislead boundary detection.
What is Sentence Tokenization?
Sentence tokenization is the process of dividing a continuous block of text into meaningful sentence units so that each sentence can be analyzed separately in NLP tasks.
Example:
Input text: I bought apples, oranges, and bananas. Then I went to the park! It was sunny outside, wasn't it? "Let's play," said my friend.
Sentence tokenized input text:
- I bought apples, oranges, and bananas.
- Then I went to the park!
- It was sunny outside, wasn't it?
- "Let's play," said my friend.
What is informal text?
Informal text refers to language that deviates from standard grammatical and stylistic conventions. It includes colloquial expressions, slang, abbreviations, misspellings, emoticons, and non-standard syntax typically found in social media, chat messages, and casual communication.
Example 1: Hey! r u coming 2 the party 2nite?
Example 2: idk if I can do this lol… anyone else tried it?
A. Tokenizers cannot read numbers
B. NSWs break the morphological structure expected by NLP models
C. These words reduce training speed
D. NSWs are not allowed in transformer-based models
Explanation:
Models expect grammatically structured tokens; NSWs distort linguistic patterns.
What are non-standard words (NSW)?
Non-standard words (NSWs) are words in text that deviate from the formal, dictionary-defined words of a language. They often appear in informal text, social media, chat messages, or user-generated content. In NLP, NSWs need to be normalized so that models can process them correctly.
Why to normalize NSWs?
NSWs can confuse tokenizers, parsers, and embeddings if not normalized. Normalizing them improves text understanding, sentiment analysis, machine translation, and speech recognition.
Examples: standard form of non-standard words "thx", "gr8", and "u" are "thanks", "great" and "you" respectively.
A. Homographs always differ in spelling
B. Homographs occur only in informal text
C. The same surface form can represent multiple meanings or pronunciations
D. Tokenizers remove ambiguity automatically
Explanation:
Common Crawl is a massive web-scraped dataset used to train large-scale transformer models.
A. “2day” → “today”
B. “u” → “you”
C. “lol” → “laughing out loud”
D. “idk” → “I don’t know”
Explanation:
Alphanumeric blends (digit + letters) require morphological + contextual reasoning.
A. They always occur at the end of a paragraph
B. They are never followed by capital letters
C. They require POS tagging
D. They contain periods that are not sentence boundaries
Explanation:
Periods inside abbreviations often confuse rule-based segmenters.
A. Unit expansion
B. Lemmatization
C. Semantic chunking
D. Syntactic pruning
Explanation:
Expanding unit abbreviations is standard NSW normalization.
A. Number of characters
B. Frequency of the word in corpus
C. POS tags of surrounding words
D. Length of the sentence
Explanation:
Contextual syntactic roles strongly indicate intended meaning (“lead pipe” vs “lead the team”).
A. It improves semantic equivalence in downstream tasks
B. It removes punctuation automatically
C. It increases dataset size
D. It is required by all tokenizers
Explanation:
Different surface forms represent the same concept; normalization avoids semantic inconsistency.
A. Semicolon
B. Exclamation mark
C. Period
D. Question mark
Explanation:
Periods appear in decimals, abbreviations, URLs, titles—causing boundary ambiguity.
A. “bk”
B. “pls”
C. “rt”
D. “mtg”
Explanation:
“rt” may mean “retweet”, “right”, or “route” depending on domain—needs contextual normalization.

No comments:
Post a Comment