Reinventing the wheel
The parsing of mark-up languages like XML and HTML is fairly well served by CPAN modules. However, for parsing a hand rolled data structure, you will have to roll your own parser in some way. For this, you may consider using regular expressions, and this will generally work out fine as long as you know what you're doing, and if you are sure that the thing you hope to parse is actually parsable with regexes. For not all things are. Although Perl's regexes are too irregular (in the mathematical sense) to be regarded as real 'regular' expressions, they are still too regular to be used to create (irregular) grammars. Irregular grammars are needed to describe nested languages like XML, hence any attempt to write a regex that matches any conceivable XML document is doomed to fail. For the same reasons, using regexes to parse HTML, or (gulp) Perl code is at best a dirty hack for a few special cases, and at worst, impossible. So don't do that.
Grammars, of which regexes are the most basic sort, are what you use
to describe, parse, and grasp the meaning of a language. The application
of a grammar to a string will allow you to determine whether that string
belongs to a given language or not. For regexes, what
/^aa*$/ actually does is test whether $_
belongs to the language described by the grammar aa*, which
is a rather simple language consisting of the 'words' a, aa, aaa, aaaa,
etc. More complicated languages need more complicated grammars,
and certain complexities in a language require qualitatively
more complex grammars to parse them: nesting is just such a
complexity.
There are several irregular grammar modules on CPAN (of which
HTML::Parser and XML::Parser you've already
met), but the one we will be looking at is Parse::RecDescent
(parse-recursive-descent). However, before we delve into using the
module, it's worth a relatively brief computer science diversion to
understand what we mean by regexes, grammars and where their limitations
stem from.
Grammars describe whether utterances, such as a DNA sequence of
aagttc, belong to a given language, such as the genetic
code. Not only do increasingly complex languages require increasingly
complex grammars, they also require an increasingly complex machine
(i.e. parser program) to make the judgement of whether a string
is derivable using the grammar. All grammars consist of four things
(concentrate, here comes the science bit, *shudder*):
- A finite set of terminal symbols, Σ,
(i.e. an alphabet). Terminal symbols are usually written with
small letters. When we try to boil an utterance down, using a top-down
parser like
Parse::RecDescent, to see if it conforms to a grammar (i.e. matches), we are trying to reduce the utterance to a list of terminal symbols by the use of the rules in the grammar. If this succeeds, the utterance is a valid part of the language described by the grammar. In perl regexes, you have ASCII or Unicode from which to draw your terminal symbols. For DNA, the alphabet consists of the four bases, hence for a genetic code parser,Σ = {a,t,g,c}. - A finite set of non-terminal sequences, N. In
boiling down an utterance we will proceed through several intermediate
stages. For example, in determining whether "aagttc" conforms to DNA
grammar, we may construct non-terminal sequences such as C (a codon,
which is a three letter DNA 'word'), knowing that C itself consists of
three bases, like
agt, and S, a complete DNA sequence consisting of many codons. Non-terminals are usually written with capital letters, so for our DNA parser,N = { C, S, ... }. - A set of production rules, P. Production rules
specify what the non-terminals can be made of. Since a codon C can
consist of any three-letter-combination of the four bases, we have the
production rules:
C → aaa C → aat C → aag etc.
- A start rule, S. A DNA sequence consists of a codon
followed by some more DNA sequence, so our start rule looks like the
self-referential:
S → C S
This will all become immediately relevant when we start using
Parse::RecDescent, since the specification of a
Parse::RecDescent grammar is just a long list of production
rules and symbols. The next science bit is just for 'fun', but well worth
remembering. The commonest way of classifying grammars is the four level
Chomsky hierarchy. Each of these levels is defined by how many
restrictions we put on what the production rules may look like:
- Regular languages. Regular languages are the most restricted: their
grammar's production rules must conform to:
A → a (non-terminal leads to a terminal symbol) A → aB (non-terminal leads to a terminal followed by a single non-terminal) A → ε (non-terminal leads to 'nothing', i.e. the null string, ε)
Regular languages can be described by the grammar of regular expressions, and can be parsed by a finite automaton (check out the link: it tells you how Perl's regexes actually work inside, which is well worth knowing). To show that a given string is a member of a particular regular language, all an automaton needs to do is show that it is either a letter from the alphabet/a/, a concatenation of letters from the alphabet/ag/, an alternation of letters from the alphabet/a|g/, a recursion of letters from the alphabet/a*/, or some (possibly nested) combination of these basic rules/(a|ga*)|g*t/. Everything else is just syntactic sugar for these basic operations:/a+/is/aa*/,/[agtc]/is/(a|t|g|c)/,/a/iis/A|a/and so on. All a finite automaton can tell you is whether a string matches by these criteria: it has no memory of what it matched. By this definition, Perl's 'regular expressions' are not regular, since Perl's regexes can also say "YES, and here's what matched:$1, $2, $3"; furthermore, real regexes can't have backreferences like\1and\2for the same reasons. Perl has some highly irregular regular expressions. The next step up from regular languages are: - Context free languages. Context-free languages have less restricted
production rules than regexes: their grammar's production rules must
conform to:
A → w
where w is any combination of terminal symbols and non-terminal symbols, e.g.S→augCS. To parse a context free language, you need a push-down automaton (which can remember certain things, unlike a finite state automaton). The great advantage of context free languages is that they can deal with nesting properly. Most computer languages are parsable using a context-free grammar (CFG): the commonest tools for writing parsers for computer languages arelexandyacc, which respectively create a stream of lexical tokens ( i.e. terminal symbols), and then see if they conform to the grammar of the language. ( N.B.lex/yaccis bottom-up: it sees if the tokens can be built up to a grammatical utterance, rather than seeing if an utterance can be broken down to legal tokens). One problem of Perl 5 regexes is that although they are not very regular, they are still not capable of matching balanced pairs: no Perl 5 regex can determine whether a nested construct like ( 2 + ( 2 + ( 5 / 6 ) + ( 9 - ( 8 * 6 ) ) ) ) is correctly parenthesised or not. Only a CFG can do this (by recursion):EXPRESSION → EXPRESSION EXPRESSION → ( EXPRESSION ) EXPRESSION → ( EXPRESSION OPERATOR EXPRESSION )
Regexes are a subset of CFG, hence you can use regexes in CFG production rules with abandon, since they are rather more concise than (but equivalent to) great lists of production rules:D → C+ C → [agtc]{3}Perl 6's 'regexes' will in fact be context free grammars (or higher), and the Perl 6 parser will be self-hosting, i.e. written in Perl 6 regexes (or will at least give that impression). This means that the Perl 6 regex engine will be powerful enough to parse Perl itself, which is a lovely bit of bootstrapping and a Good Thing: CFGs are more powerful than regexes, and they will allow you to mess with how perl parses itself from within Perl. Hahaha! One step above context free languages are: - Context sensitive languages, which are even less restricted than
CFGs. Their productions are of the form:
α A β → α γ β
where α, β and γ are any mixture of terminal or non-terminal symbols, and γ is compulsory. Only a linear-bounded nondeterministic Turing machine - no, I'm not completely sure what one is either: something to do with short lengths of paper tape, I believe ☺ - can judge whether a string is a member of a context-sensitive language. - Recursively enumerable languages. Only a Turing machine (e.g. your brain) have any hope of parsing these: they have no restrictions on the form of their production rules, and only human languages are silly enough to dabble with them. We'll have none of them here, thank you very much.
Parse::RecDescent
is a pure Perl top-down parser that can parse CF languages, and certain
CS languages. It's well worth getting your head round it, as Perl 6's
regexes bear a startling similarity, and learning
Parse::RecDescent will help you beat the rush to become the
Perl 6 regex office guru before Perl 6 is even out. Sticking with my
usual biology obsessions, we will write a parser to parse and manipulate
a DNA sequence. This is actually overkill (as you could probably do it
with a regex), but looks nicer.
Parse::RecDescent is actually very simple to use: it
simply takes two strings, one a grammar for a language, the other a
string to parse and pronounce judgement on. The bones of it look
something like:
#!/usr/bin/perl
use strict;
use Parse::RecDescent;
my $grammar = << 'GRAMMAR';
startrule: expression /^\Z/
# you can also put comments into the grammar
# such as to explain that /^\Z/ means 'match the end of the string'
expression:
number |
'(' expression ')' |
'(' expression operator expression ')'
number: /\d+/
operator: /[%*\/+-]/
GRAMMAR
my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( 1 )
{
chomp( $_ = <STDIN> );
print "Syntax error\n" unless defined $p->startrule( $_ );
}
This grammar (in bold) will only accept numbers,
9 , or correctly nested parentheticals, (9+0),
((5+9)/3). These can be whitespace-padded (the default
'skip' in Parse::RecDescent is /\s*/, although
you can modify this manually using a parser
directive). The /^\Z/ in the startrule
ensures we do not match strings that merely contain correct
substrings: it's much like /^a$/ anchoring
a regex. We have defined four productions in the grammar (which is just a
single-quoted heredoc), startrule, expression,
number and operator. Each production has the
form:
rule: description
where the description is simply what the
rule can be composed of: alternatives are denoted with a
vertical bar | as in regexes, and can extend over several
lines. The parser will attempt to match the first item in the list in
every which way it can, before moving onto the next item (in contrast to
many parsers, which look for the longest match). The items in the list
can be either literal strings like '(' or regexes like
/\d+/ or subrules.
rule:
'literal string, i.e. terminal symbol'
|
m{regex, i\.e\. a set of terminal symbols}x
|
subrule_ie_another_nonterminal_symbol
A description can even contain itself:
expression: number | '(' expression ')'
However, it should be noted that:
expression: expression | number
is illegal, since a description cannot contain the rule it describes as its leftmost item: this is because most such grammars are non-terminating (i.e. go into infinite recursion). To enable warnings about problems like this, you can turn on a variety of warnings by putting the following at the top of your script:
$::RD_ERRORS = 1; # kill parser if it encounters an error $::RD_HINT = 1; # helpful hints
(note that $::RD_ERRORS is short for the package-explicit
$main::RD_ERRORS, which itself is long for our
$RD_ERRORS. TIMTOWTDI, and there's more than one way to keep
strict happy). Subrules can be quantified much like atoms in
regexes: the equivalent of /a+/ (one-or-more) is:
rule: subrule(s)
The equivalent of /a?/ is helpfully just:
spouse: husband(?) | wife(?)
and the equivalent of /a*/ is:
name: forename middlename(s?) surname
The actually parsing takes place after the definition of the grammar:
my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( 1 )
{
chomp( $_ = <STDIN> );
print "Syntax error\n" unless defined $p->startrule( $_ );
}
This creates a new parser object $p containing a compiled
form of the grammar. To check that a string matches a particular rule in
the grammar is simply a matter of calling the rule as a method on the
parser object. I have helpfully named our start-rule (S)
startrule, but this is not compulsory. If the string matches
the grammar, the parser will return something, if it fails, it returns
undef. This is important: you must use:
if ( defined $p->rule( $string ) ){ print 'OK' }
to check if $string parses correctly.
Parsing a string is well and good, but all we have currently done is determine whether a given string is part of the language defined by the grammar. Usually, we also want to understand the meaning of the string too, i.e. do something based on its contents. To do this, we can attach actions to the productions. In the simplest case, an action looks like this:
use Parse::RecDescent;
my $grammar = q( start: /\w+/ { print "Found words !\n" } );
my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( 1 )
{
chomp( $_ = <STDIN> );
print "Syntax error\n" unless defined $p->start( $_ );
}
This attaches the action { print "Found words!\n" } to
the regex /\w+/. The code can be any bit of Perl you like,
and it will be executed if the previous item succeeded. The action
executes within the namespace of the parser, so if you plan on
manipulating variables in the body of your program from within an action,
you must qualify them with a namespace, and share them as
package globals with our (not with
my):
use strict;
use Parse::RecDescent;
our $matched;
my $grammar = q( start: /\w+/ { $::matched = "It worked" } );
my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( 1 )
{
chomp( $_ = <STDIN> );
if ( defined $p->start( $_ ) )
{ print "<<< $matched >>>\n"; }
else{ print "Syntax error\n" }
}
Within the parser namespace, you will find a number of useful
variables that you can access within an action. The most useful are
@item (and %item) and $return.
@item contains the return values of the items in the
production:
seq: start codon(s) stop { print "@item\n"; }
seq 1 ARRAY(0x23992cc) 1
$item[0] contains the name of the rule
(seq), and the slice @item[ 1 .. $#item ]
contains the the values associated with start ,
codon(s) and stop. This is actually rather
fragile (since if you change your code, it may change the position of the
item in the list). %item allows access to named subrules:
here $item{start} is equivalent to $item[1],
but is less likely to break. You can't get named access to un-named items
(such as regexes and literals), so it's generally a good idea to create a
non-terminal rule for each terminal symbol:
rule: /\d{2}/ subrule { print $item[1] }
is probably better written:
rule : numberregex subrule {print $item{numberregex} }
numberregex :/\d{2}/
So you can access the regex by name rather than by position.
The value of the things available in @item and
%item are what was actually matched: by default, either the
value of a literal string, the thing matched by a regex, or the last item
matched by a subrule. However, you can readily subvert this by using an
action to return something different. The simplest and most explicit way
of doing this is to set the value of $return in an
action:
#!/usr/bin/perl
use strict;
use Parse::RecDescent;
my $grammar = <<'GRAMMAR';
start: subrule { print "$item{subrule}\n" }
subrule: name number
{ $return = "name was $item{name} and number was $item{number}" }
name: /[A-Z]+/i { $return = $item[1] }
number: /\d+/ { $return = $item[1] }
GRAMMAR
my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( 1 )
{
chomp( $_ = <STDIN> );
unless ( defined $p->start( $_ ) ){print "Syntax error\n";}
}
Steve 666
name was Steve and number was 666
This allows you to pass items
up the rule hierarchy. By no means the last feature of
Parse::RecDescent, but the last I'll explain here is the use
of directives. Directives, which all look something like
<directive> influence how the parser runs. Four
directives that are frequently useful are <skip>,
<leftop>, <commit> and
<uncommit>. <skip> allows you to
change what the parser should ignore before a token:
rule: command <skip:'[\s,]*'> list
This will allow the parser to ignore any whitespace or commas in the
subrule list (and all its subrules).
The <leftop> directive (and the similar
<rightop> directive) allow you to create quick
left-associative binary operators and lists. For example, to match a list
like:
( Gnetum, Welwitschia, Ephedra )
you could write a rule that looks like
list : '(' /\w+/ (',' /\w+/ )(s?) ')'
or even (don't even think about it):
list :
'(' /\w+/ ')'
| '(' /\w+/ ',' /\w+/ ')'
| '(' /\w+/ ',' /\w+/ ',' /\w+/ )'
| ...
but it's much easier to write
list : '(' <leftop: /\w+/ ',' /\w+/> ')' { print "@{ $item[2] }\n" }
and let the module work it out for you. The
<leftop> directive takes three arguments: a left
operand, an operator (here the comma) and a right operand, and expands
them suitably so it will match a list (or similar) of any length. It
returns an anonymous array of the operands matched (here
$item[2] ).
The <commit> directive allows you to short-cut
through long lists of alternatives for efficiency, by commiting to a
particular alternative in a list of productions:
rule: 'hello' <commit> name | 'goodbye' name | 'goodnight' name
This means that if the rule matches the string 'hello',
but then fails to match name, the entire rule
immediately fails, since there is no point in checking the other
productions as they clearly will not match either. The
<uncommit> directive allows you to change your mind
later on (this example pilfered directly from the POD for the
module):
if_statement: 'if' <commit> condition 'then' block <uncommit> 'else' block | 'if' <commit> condition 'then' block
Here, we commit to matching the entirety of the first
production after an 'if'. This is efficient, since we won't
match the whole if_statement rule unless we match the
condition 'then' block subrules too. However, we should then
uncommit ourselves, in case there's no 'else' block: if we
don't, the whole rule will fail without trying out the second
production.
OK. Here is a simple parser script that
checks to see if an RNA sequence of A(denine), G(uanine), C(ytosine) and
U(racil: RNA has this instead of the Thymine found in DNA) bases fits a
canonical mRNA translatable sequence of a start-codon (three bases
spelling AUG), an integer number of codons, followed by a stop-codon
(which have bizarre names for historical reasons). Whilst doing this, it
helpfully translates the RNA sequence into the equivalent amino acid
sequence. It reads data in from the DATA filehandle, which,
if you've not com across it before, is automagically opened for reading
by your script when it runs. If you put any data at the end of your
script after a __DATA__ token, the DATA
filehandle has access to it without even going to the trouble of opening
a filehandle yourself. The __DATA__ section is useful for
containing configurations and default data for a script. Anyway, here's
the RNA parser in all its simplicity:
#!/usr/bin/perl
use strict;
use Parse::RecDescent;
$::RD_ERRORS = 1; # kill parser if it encounters an error
$::RD_WARN = 1; # enable warnings
$::RD_HINT = 1; # helpful hints
my $grammar = << 'GRAMMAR';
seq: start codon(s) stop /^\Z/
start: met
stop: amber | ochre | opal
codon: leu | phe | cys | trp | tyr | val | met |
gly | ala | glu | asp | pro | arg | lys |
his | gln | ser | thr | ile | asn
# Amino acid codons
lys: A A pur { print "$item[0] " }
asn: A A pyr { print "$item[0] " }
ile: A U pyr { print "$item[0] " }
thr: A C base { print "$item[0] " }
met: A U G { print "$item[0] " }
ser: A G pyr { print "$item[0] " }
gln: C A pur { print "$item[0] " }
his: C A pyr { print "$item[0] " }
arg: C G base | A G pur { print "$item[0] " }
pro: C C base { print "$item[0] " }
asp: G A pyr { print "$item[0] " }
glu: G A pur { print "$item[0] " }
ala: G C base { print "$item[0] " }
gly: G G base { print "$item[0] " }
val: G U base { print "$item[0] " }
tyr: U A pyr { print "$item[0] " }
trp: U G G { print "$item[0] " }
cys: U G pyr { print "$item[0] " }
phe: U U pyr { print "$item[0] " }
leu: U U pur | C U base { print "$item[0] " }
# Stop codons
amber: U A G { print "$item[0] " }
ochre: U A A { print "$item[0] " }
opal: U G A { print "$item[0] " }
base: pyr | pur
pyr: U | C
pur: G | A
U: /U/i
A: /A/i
G: /G/i
C: /C/i
GRAMMAR
my $p = new Parse::RecDescent( $grammar ) or die "Compile error\n";
while ( <DATA> )
{
print;
chomp;
unless ( defined $p->seq( $_ ) )
{
print "\nOoops\n";
}
}
__DATA__
AUG AAA GGC AGA AAG UAA
And for my next trick, a perl parser written in
Parse::RecDescent. Hmmm, perhaps not. You might like to look
at my calculator script to see a
rather more complex example though. Improvements and patches are welcome
☺
