MRG Utils

An Open-Source Tool for Penn Treebank-Style Combined Parses

Overview

Data from the Penn Treebank combined parses are annotated in the following manner (as seen here in wsj-0001.mrg):

( (S
    (NP-SBJ
      (NP (NNP Pierre) (NNP Vinken) )
      (, ,)
      (ADJP
        (NP (CD 61) (NNS years) )
        (JJ old) )
      (, ,) )
    (VP (MD will)
      (VP (VB join)
        (NP (DT the) (NN board) )
        (PP-CLR (IN as)
          (NP (DT a) (JJ nonexecutive) (NN director) ))
        (NP-TMP (NNP Nov.) (CD 29) )))
    (. .) ))


In order to process this file, enter the following command:


$> from mrg_utils import *
$> doc = Document('-TREEBANK-PATH/combined/wsj/00/wsj-0001.mrg'

This will create a Document object which contains a list of ordered sentences (doc.sentences),  an ordered list of syntactic heads (self.heads, arrived at using Collins-style parses with modifications first included by Marneffe et al (2005)), a list of all the ordered words within the document (doc.allWords), and all lexical heads (doc.termHeads). Each sentence object contains a pointer to the root node object of the tree (sentence.nodes) as well as a string output of the tree transformed to Python list style, (self.fullTree).

There are four kinds of node objects.  The most simple are NonTerminalNode and TerminalNode.  Both share part of speech information (node.pos), string information (node.string), sentence index within the document (node.index) and a Gorn address (node.gorn).  NonTerminalNode objects, along with RootNode objects, have an ordered list of nodes they directly govern in node.children.  All three node objects have pointer attributes node.oneUp, node.oneRight, and node.OneLeft.  They respectively point to the parent node and the left and right sisters, respectively.

Using this information, you can traverse the Penn Treebank with considerable ease.   Here is an example script that requires the environment variable PENN_TREEBANK_DIR to point to the canonical directory of PTB release II on your machine. It uses mrg_utils.py to quickly part-of-speech statistics on the entire corpus.

It outputs the following statistics (abbreviated here):

NN  163935
IN  121903
NNP 114053
DT  101190
JJ  75266
... ...
#   173
UH  117
SYM 70
LS  64

This page is best viewed using a fully CSS3-compliant browser.