Corpora & Corpus Analysis in NLP – HOT MCQs with Answers

1. What is a corpus in NLP?

A. A collection of grammar rules
B. A structured collection of real-world text
C. A table of POS tags
D. A set of model predictions

Answer: B
Explanation:

A corpus is a structured, large collection of real-world text/speech used for NLP model training & linguistic research.

2. What is the purpose of corpus annotation?

A. Compress text
B. Add linguistic labels (POS, NER, syntax)
C. Remove noisy samples
D. Convert text into images

Answer: B
Explanation:

Annotation enriches raw text by adding labels such as POS tags, entities, parse trees—making supervised NLP possible.

What is Corpus Annotation?

Corpus annotation refers to the process of adding structured, linguistic, or semantic information to raw text so that it becomes useful for NLP tasks, research, and machine learning.

In simple terms:

Corpus annotation = labeling + enriching a text corpus with extra information beyond the raw words.

Example: Types of corpus annotation + example

1. Morphological Annotation: Labels each word with Root/stem, Prefixes, suffixes, Gender, number, tense, case:

cats → cat (root), plural


2. Part-of-Speech (POS) Annotation: Assigning word classes:

He/PRP is/VBZ running/VBG


3. Syntactic Annotation: Building parse trees or dependency relations: (following example given in Penn treebank style)

(S (NP He) (VP is (VP running)))


4. Semantic Annotation: Adding meaning-level labels: Word senses; Semantic roles: Agent, Patient, Instrument; Named Entities: PERSON, ORG, LOC

Named Entity Recognition (NER) example: Apple/ORG announced/VERB ...
Semantic Role Labeling (SRL) example: [ARG0 John] [V bought] [ARG1 a car]


5. Pragmatic Annotation: Capturing speaker intentions, sarcasm, inference: Speech acts, Dialogue acts, Coherence relations

"Can you pass the salt?": Speech act: REQUEST


6. Discourse Annotation: Links across sentences: Coreference, Topic segmentation, Rhetorical Structure (RST)

John went home. He cooked dinner. “He” → John 

3. Which corpus is widely used for training modern LLMs?

A. Brown Corpus
B. LOB Corpus
C. Common Crawl
D. Gutenberg Corpus

Answer: C
Explanation:

Common Crawl is a massive web-scraped dataset used to train large-scale transformer models.

4. Type-Token Ratio (TTR) is used to measure:

A. Word ambiguity
B. Lexical richness
C. Word order
D. Syntactic complexity

Answer: B
Explanation:

TTR indicates vocabulary diversity by comparing the number of unique words to the total words.

What is Type-Token Ratio?

TTR (Type–Token Ratio) is a measure of lexical diversity in a text. It tells you how rich or varied the vocabulary is.

Type–Token Ratio = (Number of unique words) / (Total number of words)

  • Types = unique words

  • Tokens = total words (including repetitions)

Example

Text: “NLP needs data and NLP needs corpora.”

  • Tokens = 6 (NLP, needs, data, and, NLP, needs, corpora)

  • Types = 5 (NLP, needs, data, and, corpora)

TTR = 5 / 6 = 0.83

What does Type-Token Ratio (TTR) indicate?

A high TTR (close to 1) = the text uses many unique words [high lexical diversity (diverse vocabulary)]
A low TTR (close to 0) = the text repeats many words [simple or redundant text (repetitive vocabulary)]

5. Which tool is commonly used for concordance and KWIC?

A. spaCy
B. AntConc
C. TensorFlow
D. NLTK Chunker

Answer: B
Explanation:

AntConc is a popular corpus linguistics tool for KWIC, concordance, and frequency analysis.

AntConc is a free concordance software for text analysis and linguistic research, used for data-driven analysis of text and keywords. [link]

6. A balanced corpus is one that:

A. Has equal length documents
B. Represents genres proportionally
C. Contains rare words only
D. Excludes spoken text

Answer: B
Explanation:

Balanced corpora maintain proportional coverage of domains, genres, and registers. It is a well-proportioned, representative collection of texts where no genre, topic, or style dominates unfairly.

What is a balanced corpus?

A balanced corpus is a collection of texts that aims to be representative of a language by including a wide range of text categories in proportions that reflect their usage. It includes various genres, domains, and styles of writing or speech to ensure no single type of text dominates the collection, making it a more reliable resource for linguistic analysis. 

7. What does lemmatization achieve in corpus preprocessing?

A. Sentence splitting
B. Removing punctuation
C. Normalizing words to root form
D. Bigram extraction

Answer: C
Explanation:

Lemmatization maps inflected forms to dictionary roots using context (run, running → run).

What is lemmatization?

Lemmatization is the process of converting an inflected or derived word to its canonical (dictionary) form, called a lemma, using vocabulary knowledge and morphological analysis.

Example:

Word

Lemma

Caught

Catch

Better

Good

Mice

Mouse

Was

Be

Satisfies

Satisfy

8. Which dataset is a standard benchmark for Named Entity Recognition?

A. WikiText-103
B. SQuAD
C. CoNLL-2003
D. IMDb Reviews

Answer: C
Explanation:

CoNLL-2003 includes annotations for PER, ORG, LOC, MISC and is widely used for NER tasks.

What is dataset?

A dataset is an organized set of related data values, typically arranged in rows and columns or in structured formats, that is collected for analysis, research, or computational tasks.

What is Named Entity Recognition?

Named Entity Recognition (NER) is the process of detecting and categorizing predefined types of information—such as people, locations, organizations, dates, quantities, and other proper nouns—in unstructured text. It assigns each entity mention to a specific semantic class.

Example:

Apple released the iPhone 15 in California on 12 September 2025.”

NER output:

  • AppleOrganization

  • iPhone 15Product

  • CaliforniaLocation

  • 12 September 2025Date

9. Zipf’s Law states that:

A. All words appear uniformly
B. Frequency ∝ 1 / rank
C. Bigram frequency is constant
D. Word length grows exponentially

Answer: B
Explanation:

Zipf’s law states that word frequency decreases inversely with rank in the frequency list.

Definition of Zipf's law:

Zipf’s Law states that in a sufficiently large corpus of natural language, the frequency of any word is inversely proportional to its rank in the frequency list.

f(r)1rf(r) \propto \frac{1}{r}

Where f(r) refers the 'frequency of the word' and r refers the 'rank of the word' (1st most frequent, 2nd most frequent, etc.)

To understand:

If you list all the words in a language by how often they appear:

  • The most frequent word appears twice as often as the 2nd most frequent
  • Three times as often as the 3rd most frequent
  • Four times as often as the 4th most frequent, and so on

So, a few words appear very frequently, while most words appear rarely.

Example:

In English, the word “the” is extremely common, while words like “astronomy” or “serendipity” appear rarely.

10. According to Heaps’ Law, vocabulary size:

A. Remains constant
B. Grows linearly
C. Grows sublinearly as corpus expands
D. Shrinks with more data

Answer: C
Explanation:

Heaps’ Law states that vocabulary increases at a decreasing rate as corpus size grows.

Definition of Heap's law:

Heaps’ Law states that the number of unique words (vocabulary size) in a corpus increases with the size of the corpus, but at a decreasing rate. In other words, the more text you add, the more new unique words you get — but the rate of new words keeps slowing down.

Heaps’ Law is usually written as:

V(N)=KNβV(N) = K \cdot N^{\beta}

Where N is the total number of words (tokens), V(N) is the number of unique words (types), K is a constant (typically between 10–100 depending on corpus) and β is an exponent (usually between 0.4 and 0.6)

This means as a corpus grows, the vocabulary increases, but at a slower rate, with the exponent 'b' typically around 0.5. This law is useful for predicting vocabulary growth and has applications in areas like search engine indexing and content personalization.