Probabilistic Context Free Grammar (PCFG)  How to calculate probabilities for production rules from Treebank?
PCFG Solved Exercise
Question:
Consider the
following 3 trees as a Treebank and derive a Probabilistic Context Free Grammar(PCFG) from this Treebank. [Derive the probabilities of production rules]
Solution:
Let us assume that
the given Treebank is the training corpus. It consists of common start symbol S,
set of nonterminals, set of terminal symbols and production rules. Each rule
derives a nonterminal to another set of nonterminals or combination of
terminals and nonterminals. PCFG will consist of these rules with the
probability for each rule.
The probability
estimate for a rule A → s can be calculated using Maximum Likelihood estimate. Here,
s is a sequence of terminals and nonterminals. It is from (T U V)*, infinite
set of strings. [Refer here for formal definition of CFG]
The probability
estimate of a production rule can be written as,
Let us use this equation to derive the probabilities;
1. If you look at
the Treebank, the common start symbol S is derived to NP and VP. This is the
rule S → NP VP.
2. V1 and SBAR
derived from VP. Hence, VP → V1 SBAR.
3. Next rule, SBAR → COMP S.
4. Let us choose one rule that has only the terminal symbol on its RHS. For example, the rule, NP → John is one such rule (We refer it as lexicon).
The other probabilities
can be calculated as shown above. As a result, we get the grammar and the
corresponding probabilities for the PCFG as follows;
Rule

Probability

S → NP VP
SBAR → COMP S
COMP → that
VP → V1 SBAR  VP
ADVP  V2
NP → Sally
NP → JohnBillFredJeff
V1 → saiddeclaredpronounced
V2 → snoredranswam
ADVP → loudly  quickly
elegantly

1
1
1
1/3
1/3
1/6
1/3
1/3
1/3

**********
Related links:
 Go to Probabilistic Context Free Grammar (PCFG) Home
 Go to Natural Language Processing (NLP) home
 Go to NLP Solved Exercise page
 Go to Context Free Grammar (CFG) Formal Definition page