Corpora & Corpus Analysis in NLP – HOT MCQs with Answers
A. A collection of grammar rules
B. A structured collection of real-world text
C. A table of POS tags
D. A set of model predictions
Explanation:
A corpus is a structured, large collection of real-world text/speech used for NLP model training & linguistic research.
A. Compress text
B. Add linguistic labels (POS, NER, syntax)
C. Remove noisy samples
D. Convert text into images
Explanation:
Annotation enriches raw text by adding labels such as POS tags, entities, parse trees—making supervised NLP possible.
What is Corpus Annotation?
Corpus annotation refers to the process of adding structured, linguistic, or semantic information to raw text so that it becomes useful for NLP tasks, research, and machine learning.
In simple terms:
Corpus annotation = labeling + enriching a text corpus with extra information beyond the raw words.
Example: Types of corpus annotation + example
cats → cat (root), plural
He/PRP is/VBZ running/VBG
(S (NP He) (VP is (VP running)))
Semantic Role Labeling (SRL) example: [ARG0 John] [V bought] [ARG1 a car]"Can you pass the salt?": Speech act: REQUEST
John went home. He cooked dinner. “He” → John
A. Brown Corpus
B. LOB Corpus
C. Common Crawl
D. Gutenberg Corpus
Explanation:
Common Crawl is a massive web-scraped dataset used to train large-scale transformer models.
A. Word ambiguity
B. Lexical richness
C. Word order
D. Syntactic complexity
Explanation:
TTR indicates vocabulary diversity by comparing the number of unique words to the total words.
What is Type-Token Ratio?
TTR (Type–Token Ratio) is a measure of lexical diversity in a text. It tells you how rich or varied the vocabulary is.
Type–Token Ratio = (Number of unique words) / (Total number of words)
-
Types = unique words
-
Tokens = total words (including repetitions)
Text: “NLP needs data and NLP needs corpora.”
-
Tokens = 6 (
NLP, needs, data, and, NLP, needs, corpora) -
Types = 5 (
NLP, needs, data, and, corpora)
TTR = 5 / 6 = 0.83
What does Type-Token Ratio (TTR) indicate?
A high TTR (close to 1) = the text uses many unique words [high lexical diversity (diverse vocabulary)]
A low TTR (close to 0) = the text repeats many words [simple or redundant text (repetitive vocabulary)]
A. spaCy
B. AntConc
C. TensorFlow
D. NLTK Chunker
Explanation:
AntConc is a popular corpus linguistics tool for KWIC, concordance, and frequency analysis.
AntConc is a free concordance software for text analysis and linguistic research, used for data-driven analysis of text and keywords. [link]
A. Has equal length documents
B. Represents genres proportionally
C. Contains rare words only
D. Excludes spoken text
Explanation:
Balanced corpora maintain proportional coverage of domains, genres, and registers. It is a well-proportioned, representative collection of texts where no genre, topic, or style dominates unfairly.
What is a balanced corpus?
A balanced corpus is a collection of texts that aims to be representative of a language by including a wide range of text categories in proportions that reflect their usage. It includes various genres, domains, and styles of writing or speech to ensure no single type of text dominates the collection, making it a more reliable resource for linguistic analysis.
A. Sentence splitting
B. Removing punctuation
C. Normalizing words to root form
D. Bigram extraction
Explanation:
Lemmatization maps inflected forms to dictionary roots using context (run, running → run).
What is lemmatization?
Lemmatization is the process of converting an inflected or derived word to its canonical (dictionary) form, called a lemma, using vocabulary knowledge and morphological analysis.
Example:
|
Word |
Lemma |
|
Caught |
Catch |
|
Better |
Good |
|
Mice |
Mouse |
|
Was |
Be |
|
Satisfies |
Satisfy |
A. WikiText-103
B. SQuAD
C. CoNLL-2003
D. IMDb Reviews
Explanation:
CoNLL-2003 includes annotations for PER, ORG, LOC, MISC and is widely used for NER tasks.
What is dataset?
A dataset is an organized set of related data values, typically arranged in rows and columns or in structured formats, that is collected for analysis, research, or computational tasks.
What is Named Entity Recognition?
Named Entity Recognition (NER) is the process of detecting and categorizing predefined types of information—such as people, locations, organizations, dates, quantities, and other proper nouns—in unstructured text. It assigns each entity mention to a specific semantic class.
Example:
“Apple released the iPhone 15 in California on 12 September 2025.”
NER output:
-
Apple → Organization
-
iPhone 15 → Product
-
California → Location
-
12 September 2025 → Date
A. All words appear uniformly
B. Frequency ∝ 1 / rank
C. Bigram frequency is constant
D. Word length grows exponentially
Explanation:
Zipf’s law states that word frequency decreases inversely with rank in the frequency list.
Definition of Zipf's law:
Zipf’s Law states that in a sufficiently large corpus of natural language, the frequency of any word is inversely proportional to its rank in the frequency list.
Where f(r) refers the 'frequency of the word' and r refers the 'rank of the word' (1st most frequent, 2nd most frequent, etc.)
If you list all the words in a language by how often they appear:
- The most frequent word appears twice as often as the 2nd most frequent
- Three times as often as the 3rd most frequent
- Four times as often as the 4th most frequent, and so on
So, a few words appear very frequently, while most words appear rarely.
Example:
In English, the word “the” is extremely common, while words like “astronomy” or “serendipity” appear rarely.
A. Remains constant
B. Grows linearly
C. Grows sublinearly as corpus expands
D. Shrinks with more data
Explanation:
Heaps’ Law states that vocabulary increases at a decreasing rate as corpus size grows.
Definition of Heap's law:
Heaps’ Law states that the number of unique words (vocabulary size) in a corpus increases with the size of the corpus, but at a decreasing rate. In other words, the more text you add, the more new unique words you get — but the rate of new words keeps slowing down.
Heaps’ Law is usually written as:
Where N is the total number of words (tokens), V(N) is the number of unique words (types), K is a constant (typically between 10–100 depending on corpus) and β is an exponent (usually between 0.4 and 0.6)
This means as a corpus grows, the vocabulary increases, but at a slower rate, with the exponent 'b' typically around 0.5. This law is useful for predicting vocabulary growth and has applications in areas like search engine indexing and content personalization.
No comments:
Post a Comment