Lesson 18

Reinventing the wheel

The parsing of mark-up languages like XML and HTML is fairly well served by CPAN modules. However, for parsing a hand rolled data structure, you will have to roll your own parser in some way. For this, you may consider using regular expressions, and this will generally work out fine as long as you know what you're doing, and if you are sure that the thing you hope to parse is actually parsable with regexes. For not all things are. Although Perl's regexes are too irregular (in the mathematical sense) to be regarded as real 'regular' expressions, they are still too regular to be used to create (irregular) grammars. Irregular grammars are needed to describe nested languages like XML, hence any attempt to write a regex that matches any conceivable XML document is doomed to fail. For the same reasons, using regexes to parse HTML, or (gulp) Perl code is at best a dirty hack for a few special cases, and at worst, impossible. So don't do that.

Grammars, of which regexes are the most basic sort, are what you use to describe, parse, and grasp the meaning of a language. The application of a grammar to a string will allow you to determine whether that string belongs to a given language or not. For regexes, what /^aa*$/ actually does is test whether $_ belongs to the language described by the grammar aa*, which is a rather simple language consisting of the 'words' a, aa, aaa, aaaa, etc. More complicated languages need more complicated grammars, and certain complexities in a language require qualitatively more complex grammars to parse them: nesting is just such a complexity.

There are several irregular grammar modules on CPAN (of which HTML::Parser and XML::Parser you've already met), but the one we will be looking at is Parse::RecDescent (parse-recursive-descent). However, before we delve into using the module, it's worth a relatively brief computer science diversion to understand what we mean by regexes, grammars and where their limitations stem from.

Grammars describe whether utterances, such as a DNA sequence of aagttc, belong to a given language, such as the genetic code. Not only do increasingly complex languages require increasingly complex grammars, they also require an increasingly complex machine (i.e. parser program) to make the judgement of whether a string is derivable using the grammar. All grammars consist of four things (concentrate, here comes the science bit, *shudder*):

This will all become immediately relevant when we start using Parse::RecDescent, since the specification of a Parse::RecDescent grammar is just a long list of production rules and symbols. The next science bit is just for 'fun', but well worth remembering. The commonest way of classifying grammars is the four level Chomsky hierarchy. Each of these levels is defined by how many restrictions we put on what the production rules may look like:

Parse::RecDescent is a pure Perl top-down parser that can parse CF languages, and certain CS languages. It's well worth getting your head round it, as Perl 6's regexes bear a startling similarity, and learning Parse::RecDescent will help you beat the rush to become the Perl 6 regex office guru before Perl 6 is even out. Sticking with my usual biology obsessions, we will write a parser to parse and manipulate a DNA sequence. This is actually overkill (as you could probably do it with a regex), but looks nicer.

Parse::RecDescent is actually very simple to use: it simply takes two strings, one a grammar for a language, the other a string to parse and pronounce judgement on. The bones of it look something like:

#!/usr/bin/perl
use strict;
use Parse::RecDescent;
my $grammar = << 'GRAMMAR';
    startrule: expression /^\Z/
    # you can also put comments into the grammar
    # such as to explain that /^\Z/ means 'match the end of the string'
    expression:
        number |
        '(' expression ')' |
    '(' expression operator expression ')'
    number: /\d+/
    operator: /[%*\/+-]/
GRAMMAR
my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( 1 )
{
    chomp( $_ = <STDIN> );
    print "Syntax error\n" unless defined $p->startrule( $_ );
}

This grammar (in bold) will only accept numbers, 9 , or correctly nested parentheticals, (9+0), ((5+9)/3). These can be whitespace-padded (the default 'skip' in Parse::RecDescent is /\s*/, although you can modify this manually using a parser directive). The /^\Z/ in the startrule ensures we do not match strings that merely contain correct substrings: it's much like /^a$/ anchoring a regex. We have defined four productions in the grammar (which is just a single-quoted heredoc), startrule, expression, number and operator. Each production has the form:

rule: description

where the description is simply what the rule can be composed of: alternatives are denoted with a vertical bar | as in regexes, and can extend over several lines. The parser will attempt to match the first item in the list in every which way it can, before moving onto the next item (in contrast to many parsers, which look for the longest match). The items in the list can be either literal strings like '(' or regexes like /\d+/ or subrules.

rule:
'literal string, i.e. terminal symbol'
        |
m{regex, i\.e\. a set of terminal symbols}x
        |
subrule_ie_another_nonterminal_symbol

A description can even contain itself:

expression: number | '(' expression ')'

However, it should be noted that:

expression: expression | number

is illegal, since a description cannot contain the rule it describes as its leftmost item: this is because most such grammars are non-terminating (i.e. go into infinite recursion). To enable warnings about problems like this, you can turn on a variety of warnings by putting the following at the top of your script:

$::RD_ERRORS = 1; # kill parser if it encounters an error
$::RD_HINT   = 1; # helpful hints

(note that $::RD_ERRORS is short for the package-explicit $main::RD_ERRORS, which itself is long for our $RD_ERRORS. TIMTOWTDI, and there's more than one way to keep strict happy). Subrules can be quantified much like atoms in regexes: the equivalent of /a+/ (one-or-more) is:

rule: subrule(s)

The equivalent of /a?/ is helpfully just:

spouse: husband(?) | wife(?)

and the equivalent of /a*/ is:

name: forename middlename(s?) surname

The actually parsing takes place after the definition of the grammar:

my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( 1 )
{
    chomp( $_ = <STDIN> );
    print "Syntax error\n" unless defined $p->startrule( $_ );
}

This creates a new parser object $p containing a compiled form of the grammar. To check that a string matches a particular rule in the grammar is simply a matter of calling the rule as a method on the parser object. I have helpfully named our start-rule (S) startrule, but this is not compulsory. If the string matches the grammar, the parser will return something, if it fails, it returns undef. This is important: you must use:

if ( defined $p->rule( $string ) ){ print 'OK' }

to check if $string parses correctly.

Parsing a string is well and good, but all we have currently done is determine whether a given string is part of the language defined by the grammar. Usually, we also want to understand the meaning of the string too, i.e. do something based on its contents. To do this, we can attach actions to the productions. In the simplest case, an action looks like this:

use Parse::RecDescent;
my $grammar = q( start: /\w+/ { print "Found words !\n" } );
my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( 1 )
{
    chomp( $_ = <STDIN> );
    print "Syntax error\n" unless defined $p->start( $_ );
}

This attaches the action { print "Found words!\n" } to the regex /\w+/. The code can be any bit of Perl you like, and it will be executed if the previous item succeeded. The action executes within the namespace of the parser, so if you plan on manipulating variables in the body of your program from within an action, you must qualify them with a namespace, and share them as package globals with our (not with my):

use strict;
use Parse::RecDescent;
our $matched;
my $grammar = q( start: /\w+/ { $::matched = "It worked" } );
my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( 1 )
{
    chomp( $_ = <STDIN> );
    if ( defined $p->start( $_ ) )
        { print "<<< $matched >>>\n"; }
    else{ print "Syntax error\n" }
}

Within the parser namespace, you will find a number of useful variables that you can access within an action. The most useful are @item (and %item) and $return. @item contains the return values of the items in the production:

seq: start codon(s) stop { print "@item\n"; }
seq 1 ARRAY(0x23992cc) 1

$item[0] contains the name of the rule (seq), and the slice @item[ 1 .. $#item ] contains the the values associated with start , codon(s) and stop. This is actually rather fragile (since if you change your code, it may change the position of the item in the list). %item allows access to named subrules: here $item{start} is equivalent to $item[1], but is less likely to break. You can't get named access to un-named items (such as regexes and literals), so it's generally a good idea to create a non-terminal rule for each terminal symbol:

rule: /\d{2}/ subrule { print $item[1] }

is probably better written:

rule : numberregex subrule {print $item{numberregex} }
numberregex :/\d{2}/

So you can access the regex by name rather than by position.

The value of the things available in @item and %item are what was actually matched: by default, either the value of a literal string, the thing matched by a regex, or the last item matched by a subrule. However, you can readily subvert this by using an action to return something different. The simplest and most explicit way of doing this is to set the value of $return in an action:

#!/usr/bin/perl
use strict;
use Parse::RecDescent;
my $grammar = <<'GRAMMAR';
    start: subrule { print "$item{subrule}\n" }
    subrule: name number
        { $return = "name was $item{name} and number was $item{number}" }
    name: /[A-Z]+/i { $return = $item[1] }
    number: /\d+/ { $return = $item[1] }
GRAMMAR
my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( 1 )
{
    chomp( $_ = <STDIN> );
    unless ( defined $p->start( $_ ) ){print "Syntax error\n";}
}
Steve 666
name was Steve and number was 666

This allows you to pass items up the rule hierarchy. By no means the last feature of Parse::RecDescent, but the last I'll explain here is the use of directives. Directives, which all look something like <directive> influence how the parser runs. Four directives that are frequently useful are <skip>, <leftop>, <commit> and <uncommit>. <skip> allows you to change what the parser should ignore before a token:

rule: command <skip:'[\s,]*'> list

This will allow the parser to ignore any whitespace or commas in the subrule list (and all its subrules).

The <leftop> directive (and the similar <rightop> directive) allow you to create quick left-associative binary operators and lists. For example, to match a list like:

( Gnetum, Welwitschia, Ephedra )

you could write a rule that looks like

list : '(' /\w+/ (',' /\w+/ )(s?) ')'

or even (don't even think about it):

list :
'(' /\w+/ ')'
| '(' /\w+/ ',' /\w+/ ')'
| '(' /\w+/ ',' /\w+/ ',' /\w+/ )'
| ...

but it's much easier to write

list : '(' <leftop: /\w+/ ',' /\w+/> ')' { print "@{ $item[2] }\n" }

and let the module work it out for you. The <leftop> directive takes three arguments: a left operand, an operator (here the comma) and a right operand, and expands them suitably so it will match a list (or similar) of any length. It returns an anonymous array of the operands matched (here $item[2] ).

The <commit> directive allows you to short-cut through long lists of alternatives for efficiency, by commiting to a particular alternative in a list of productions:

rule: 'hello' <commit> name | 'goodbye' name | 'goodnight' name

This means that if the rule matches the string 'hello', but then fails to match name, the entire rule immediately fails, since there is no point in checking the other productions as they clearly will not match either. The <uncommit> directive allows you to change your mind later on (this example pilfered directly from the POD for the module):

if_statement:
'if' <commit> condition
'then' block <uncommit>
'else' block
|
'if' <commit> condition
'then' block

Here, we commit to matching the entirety of the first production after an 'if'. This is efficient, since we won't match the whole if_statement rule unless we match the condition 'then' block subrules too. However, we should then uncommit ourselves, in case there's no 'else' block: if we don't, the whole rule will fail without trying out the second production.

OK. Here is a simple parser script that checks to see if an RNA sequence of A(denine), G(uanine), C(ytosine) and U(racil: RNA has this instead of the Thymine found in DNA) bases fits a canonical mRNA translatable sequence of a start-codon (three bases spelling AUG), an integer number of codons, followed by a stop-codon (which have bizarre names for historical reasons). Whilst doing this, it helpfully translates the RNA sequence into the equivalent amino acid sequence. It reads data in from the DATA filehandle, which, if you've not com across it before, is automagically opened for reading by your script when it runs. If you put any data at the end of your script after a __DATA__ token, the DATA filehandle has access to it without even going to the trouble of opening a filehandle yourself. The __DATA__ section is useful for containing configurations and default data for a script. Anyway, here's the RNA parser in all its simplicity:

#!/usr/bin/perl
use strict;
use Parse::RecDescent;
$::RD_ERRORS = 1; # kill parser if it encounters an error
$::RD_WARN = 1; # enable warnings
$::RD_HINT = 1; # helpful hints
my $grammar = << 'GRAMMAR';
       seq: start codon(s) stop /^\Z/
       start: met 
       stop: amber | ochre | opal
       codon: leu | phe | cys | trp | tyr | val | met |
              gly | ala | glu | asp | pro | arg | lys |
              his | gln | ser | thr | ile | asn
        # Amino acid codons
        lys: A A pur  { print "$item[0] " }
        asn: A A pyr  { print "$item[0] " }
        ile: A U pyr  { print "$item[0] " }
        thr: A C base { print "$item[0] " }
        met: A U G    { print "$item[0] " }
        ser: A G pyr  { print "$item[0] " }
        gln: C A pur  { print "$item[0] " }
        his: C A pyr  { print "$item[0] " }
        arg: C G base | A G pur { print "$item[0] " }
        pro: C C base { print "$item[0] " }
        asp: G A pyr  { print "$item[0] " }
        glu: G A pur  { print "$item[0] " }
        ala: G C base { print "$item[0] " }
        gly: G G base { print "$item[0] " }
        val: G U base { print "$item[0] " }
        tyr: U A pyr  { print "$item[0] " }
        trp: U G G    { print "$item[0] " }
        cys: U G pyr  { print "$item[0] " }
        phe: U U pyr  { print "$item[0] " }
        leu: U U pur | C U base {  print "$item[0] " }
        # Stop codons
        amber: U A G  { print "$item[0] " }
        ochre: U A A  { print "$item[0] " }
        opal:  U G A  { print "$item[0] " }
        base: pyr | pur
        pyr: U | C
        pur: G | A
        U: /U/i
        A: /A/i
        G: /G/i
        C: /C/i
GRAMMAR
my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( <DATA> )
{
    print;
    chomp;
    unless ( defined $p->seq( $_ ) )
    {
        print "\nOoops\n";
    }
}
__DATA__
AUG AAA GGC AGA AAG UAA

And for my next trick, a perl parser written in Parse::RecDescent. Hmmm, perhaps not. You might like to look at my calculator script to see a rather more complex example though. Improvements and patches are welcome ☺

Next…