Data from the Penn Treebank combined parses are annotated in the
following manner (as seen here in wsj-0001.mrg):
( (S
(NP-SBJ
(NP (NNP Pierre)
(NNP Vinken) )
(, ,)
(ADJP
(NP (CD 61) (NNS years) )
(JJ old) )
(, ,) )
(VP (MD will)
(VP (VB join)
(NP (DT the) (NN board) )
(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director) ))
(NP-TMP (NNP Nov.) (CD 29) )))
(. .) ))
In order to process this file, enter the following command:
$> from mrg_utils
import *
$> doc =
Document('-TREEBANK-PATH/combined/wsj/00/wsj-0001.mrg'
This will create a Document object which contains a list of ordered
sentences (doc.sentences), an ordered list
of syntactic heads (self.heads, arrived at using Collins-style parses
with modifications first included by Marneffe et
al (2005)), a list of all the ordered words within the document
(doc.allWords), and all lexical heads (doc.termHeads).
Each sentence object contains a pointer to the root node object of the
tree (sentence.nodes) as well as a string output
of the tree transformed to Python list style, (self.fullTree).
There are four kinds of node objects. The most simple are
NonTerminalNode and TerminalNode. Both share
part of speech information (node.pos), string information
(node.string), sentence index within the document (node.index)
and a Gorn
address (node.gorn). NonTerminalNode objects, along
with RootNode objects, have an ordered list of
nodes they directly govern in node.children. All three node
objects have pointer attributes node.oneUp, node.oneRight,
and node.OneLeft. They respectively point to the parent node
and the left and right sisters, respectively.
Using this information, you can traverse the Penn Treebank with
considerable ease. Here is an example
script that requires the environment variable
PENN_TREEBANK_DIR to point to the canonical directory of PTB release II
on your machine. It
uses mrg_utils.py to quickly part-of-speech statistics on the
entire corpus.
It outputs the following statistics (abbreviated here):
NN 163935
IN 121903
NNP 114053
DT 101190
JJ 75266
... ...
# 173
UH 117
SYM 70
LS 64