Lesson 17

Universal glue

One of the things that Perl is useful for is providing the glue to allow disparate bits of software to interact smoothly. One of the most important parts of this usefulness is the number of modules available on CPAN for parsing various file formats, so as to dynamically convert from one representation of data (that program X understands) to another (that program Y understands). We have already looked at one of these parsers, HTML::Parser. The next two lessons of the tutorial will deal with two of Perl's other parser modules: XML::Parser, which is for parsing XML (well, durr), and Parse::RecDescent, which is for constructing generalised grammars, allowing you to parse such things as programming languages.

What we'll do here is use XML::Parser to convert an XML document containing a biological hierarchy (i.e. a list of taxonomic groups from kingdom Animalia down to species Homo sapiens or similar) into an HTML reference guide to Life. As with the SQL "tutorial", this isn't really the place to talk about XML, but briefly, a well formed XML document should:

A valid XML document also has a DTD (document type description) indicating what elements may contain in terms of data, and other elements. Here is the well formed (but invalid, i.e. DTD-less) XML document we'll be playing with, and here are the three HTML document we wish to create from them: Archaea, Bacteria and Eukarya. I'd have a look at them if I were you or you may get a little lost in the biology bit of this lesson ;)

XML::Parser is built on top of the Expat library, which is an event driven XML parser. However, it is capable of outputting a variety of constructs, or styles. Since an XML document is basically a large data tree, one of these styles is unsurprisingly called Tree, and when the XML is parsed, it returns an arrayref containing the entire document translated into a perl Thingy (i.e. a glutinous mass of array/hashrefs), which you can explore using standard built-ins like for and each. The Object style is similar but package/object oriented. Alternatively, we can use the Subs or Stream style and register various handlers for events such as entering a tag, leaving a tag, finding a comment, and finding text, much like we did for HTML::Parser. There are also many modules built on top of XML::Parser providing standard XML APIs like SAX and DOM. However, we will be using simple old XML::Parser with the Tree style in this program, since our data is essentially just a tree, and funnily enough this is by far the easiest way to manipulate our data. (I spent rather a lot of time tearing my hair out trying to get a Stream version to work, which will teach me to do the obvious new thing, rather than the familiar but unworkable thing).

The HTML documents we want to produce will have separate pages for each of Life's domains, of which there are three, Archaea, Bacteria and Eukarya: the first are those weird-arse 'bacteria' that live in boiling sulfuric acid, the last are all organisms except the bacteria, i.e. everything from amoebae to fig wasps). These HTML documents should contain a pretty printed list of taxon names and notes for each lower level of classification, neatly indented to indicate the level of the taxon. For your information, the principal biological taxa are called (in descending order of importance) domains, empires, kingdoms, phyla or divisions (it's a zoology/botany feud as to what this level should be called), classes, cohorts (optional), orders, families, tribes (optional), genera, species, varieties and forms (latter two optional). You can also have super-, sub- and infra- (i.e. sub-sub-) taxa: a superclass contains a few classes, but is lower in the hierarchy than phylum. This is all rather confusing, so I leave writing the DTD as an exercise for the interested reader. We will also allow text sections to contain embedded <latin> tags, to mark out Latin names (which are traditionally formatted in italics).

The XML document we will be parsing is here. It basically just consists of a hierarchy of tags from <life>, through <empire>, <kingdom>, <subkingdom>, etc., down to <class> (and presumably <species> through to <form> if I hadn't got bored already). Each of these tags has an attribute giving the name of the group, and each <taxon name="blah"> </taxon> tag pair surrounds some interesting information about the taxon. Since some of the taxa in biological systematics are in a state of flux, the taxon tags can also have an optional attribute (status) indicating this. Here is a typical node:

<subkingdom name="Radiata" status="May be paraphyletic, often referred to as the coelenterates.">
The radially symmetrical animals. 
All other animals are (more or less) bilaterally symmetrical.
...some child (phylum nodes) ...
</subkingdom>

The first thing to to is code up a bare bones parser to handle the XML stream:

#!/usr/bin/perl
use strict;
use warnings;
use XML::Parser;
my $p = XML::Parser->new( Style => "Tree" );
my $input = "life.xml";

$p is our parser, and when we instantiate it, we do so using the promised Tree style. We also specify an input file. The output files will be dumped into the CWD. Then we set the parser off running. The parsefile function returns the XML parse-tree, which we then traverse to generate our HTML.

traverse( $p->parsefile( $input ) );
exit(0);

That, as usual, was the easy bit. Now we need to code up the parser's traverse function. This is surprisingly easy once we understand what the XML, parse-tree and HTML look like. Here are some (simplified from the sources) examples:

XML

<domain name="Archaea" status="Previously the Archaebacteria.">
  The Archaea are unusual prokaryotes, with a morphology similar to the 
  Eubacteria, but in biochemistry and molecular biology, they more closely 
  resemble the eukaryotes. Probably more closely related to the eukaryotes 
  than to the eubacteria.
  <kingdom name="Crenarchaeota">
    Extremely heat and/or acid tolerant bacteria.
  </kingdom>
  <kingdom name="Euryarchaeota">
    Extremely salt tolerant bacteria and the methanogenic bacteria that 
    make you fart.
    <division name="Methanogena" status="Functional, not evolutionary grouping.">
      Methane producing archaea.
    </division>
    <division name="Halophila" status="Functional, not evolutionary grouping.">
      Extreme salt tolerant archaea.
    </division>
    <division name="Thermacidophila" status="Functional, not evolutionary grouping.">
      Extremophiles capable of growing at pH 2 and in boiling water.
    </division>
  </kingdom>
  <kingdom name="Korarchaeota">
    Archaea only known from their DNA sequences: not yet cultured in the
    laboratory.
  </kingdom>
  <kingdom name="Nanoarchaeota">
    Endosymbiotic bacterium (<latin>Nanoarchaeum equitans</latin>), 
    tentatively assigned to its own kingdom.
  </kingdom>
</domain>

Parse tree

For the sake of convenience, I have removed some newlines and tabs from before, after and within the 'text' items (those preceded by a '0'. We will still need our parser to tidy these up. Note the empty text nodes preceeding most of the mark-up elements like kingdom and domain: these contain the indentation whitespace from the original XML.

$parse_tree = 
[
  'domain',
  [
    {
      'status' => 'Previously the Archaebacteria.',
      'name' => 'Archaea'
    },
    0,  
    'The Archaea are unusual prokaryotes, with a morphology similar to the 
    Eubacteria, but in biochemistry and molecular biology, they more closely resemble the eukaryotes. 
    Probably more closely related to the eukaryotes than to the eubacteria.',
    'kingdom',
    [
      {
        'name' => 'Crenarchaeota'
      },
      0,
      'Extremely heat and/or acid tolerant bacteria.'
    ],
    0,
    '',
    'kingdom',
    [
      {
        'name' => 'Euryarchaeota'
      },
      0,
      'Extremely salt tolerant bacteria and the methanogenic bacteria that make you fart.',
      'division',
      [
        {
          'status' => 'Functional, not evolutionary grouping.',
          'name' => 'Methanogena'
        },
        0,
        'Methane producing archaea.'
      ],
      0,
      '',
      'division',
      [
        {
          'status' => 'Functional, not evolutionary grouping.',
          'name' => 'Halophila'
        },
        0,
        'Extreme salt tolerant archaea.'
      ],
      0,
      '
                        ',
      'division',
      [
        {
          'status' => 'Functional, not evolutionary grouping.',
          'name' => 'Thermacidophila'
        },
        0,
        'Extremophiles capable of growing at pH 2 and in boiling water.'
      ],
      0,
      '
                '
    ],
    0,
    '',
    'kingdom',
    [
      {
        'name' => 'Korarchaeota'
      },
      0,
      'Archaea only known from their DNA sequences: not yet cultured in the laboratory.'
    ],
    0,
    '',
    'kingdom',
    [
      {
        'name' => 'Nanoarchaeota'
      },
      0,
      'Endosymbiotic bacterium (',
      'latin',
      [
        {},
        0,
        'Nanoarchaeum equitans'
      ],
      0,
      '), tentatively assigned to its own kingdom.'
    ],
    0,
    ''
  ]
];

HTML

Missing boilerplate headers: this is just the stuff in the body:

<p><span class="domain">Domain Archaea</span></p>
<p class="status">Previously the Archaebacteria.</p>
<p>The Archaea are unusual prokaryotes, with a morphology similar to the Eubacteria, but in biochemistry and molecular biology, they more closely resemble the eukaryotes. Probably more closely related to the eukaryotes than to the eubacteria.</p>
<ul>
  <li><span class="kingdom">Kingdom Crenarchaeota</span>
    <p>Extremely heat and/or acid tolerant bacteria.</p>
  </li>
  <li><span class="kingdom">Kingdom Euryarchaeota</span>
    <p>Extremely salt tolerant bacteria and the methanogenic bacteria that make you fart.</p>
    <ul>
    <li><span class="division">Division Methanogena</span>
      <p class="status">Functional, not evolutionary grouping.</p>
      <p>Methane producing archaea.</p>
    </li>
    <li><span class="division">Division Halophila</span>
      <p class="status">Functional, not evolutionary grouping.</p>
      <p>Extreme salt tolerant archaea.</p>
    </li>
    <li><span class="division">Division Thermacidophila</span>
      <p class="status">Functional, not evolutionary grouping.</p>
      <p>Extremophiles capable of growing at pH 2 and in boiling water.</p>
    </li>
    </ul>
  </li>
  <li><span class="kingdom">Kingdom Korarchaeota</span>
    <p>Archaea only known from their DNA sequences: not yet cultured in the laboratory.</p>
  </li>
  <li><span class="kingdom">Kingdom Nanoarchaeota</span>
    <p>Endosymbiotic bacterium (<em>Nanoarchaeum equitans</em>), tentatively assigned to its own kingdom.</p>
  </li>
</ul>

XML::Parser generates a parse-tree from the XML source. If the XML is well-formed, this parse-tree will be a two-element arrayref. If it isn't the parser will bring the whole program down in flames, hence, wrap the parse in a block eval if you're worried:

eval { $p->parsefile( $input )};
if ( $@ ) { print "Oops, $input is malformed: $@\n";

If the XML is well-formed, the zeroeth element of the parse-tree is the root element of the XML (i.e. <life>), and the first element is another arrayref, containing the children of the root element (i.e. the rest of the document). In fact, every node of the parse-tree looks something like:

$node = 
    [ 
       "element1", 
           [ node_of_element_1's_children ... ], 
       "element2",
           [ node_of_element_2's_children ... ],
        ...
    ];

with the parse-tree itself being a special case having (by definition) just two array elements (since a valid XML document has only one root).

The attributes of the element and plain text are also recognisable in the parse-tree. An element's attributes are found in a hashref as the zeroeth element of its own child node:

$node =
[
    "domain",
    [
        { name => "Archaea" }, 
                   # attributes of domain are zeroeth child element
        "kingdom",
                   # kingdom is a child element of domain
        [ ... ],   
                   # and has its own node of child elements
        "kingdom",
        [ ... ],
    ]
];

Plain text is found as a child node with the special name '0':

$node =
[
    "domain",
    [
        { name => "Archaea" }, 
            # attributes of domain
        '0', 
            # Zero as an element name indicates plain text
        "Plain text is here",
            # instead of another arrayref node
        "kingdom",
        [ ... ],
    ]
];

This seems rather complicated and self-referential, but an easy way of dealing with a recursive structure is to write a recursive function (i.e. one that calls itself), and it's not as difficult as it sounds:

sub traverse
{
    my $node = shift;
    while ( defined ( my ( $element = shift @{ $node } ) 
    {
        my $child = shift @{ $node };
        if ( ref $child ) 
        {
            # if $child is yet another arrayref node
            my %attr = %{ shift @{ $child } };
                # yank the attributes out of the grandchild
            print "$element has attributes @{[ %attr ]}\n";
            traverse( $child );
                # child element is now just a list of element/children pairs
        }
        else 
        {
            # otherwise, if $child is just some text
            print "$child\n";
        }
    }
}

This is a basic traversal routine that will take a node and shift the elements out of it two at a time. The first of these is the element's name (e.g. domain), the second is the child node. It then checks to see if the child node is a reference. If the child is a reference, this means it is yet another node, and hence needs traversing itself. So we shift the attributes out of the grandchild node, print out some data about it and then recursively call traverse to deal with the child node. However, if the child node is not a reference, it must be plain text of some sort, which we print out as it is. To manipulate the data, all we need to do is add some specific, element-sensitive code to these two possibilities.

In the following traversal routine, we ignore life elements, create new files when we come across domain elements, italicise text when we come across latin elements, and cache our data in an nested HTML list-of-lists-of-lists-etc., otherwise. The recursive nature of the traversal neatly ensures the correct nesting of <ul>, </ul>, <li> and </li> tags in the HTML.

Note that the <latin> nodes need a little more consideration: if we just parsed them using the main traverse routine, we'd end up with all the Latin names as nested bulleted lists under their parent description text. Since these <latin> nodes are span style HTML elements, rather than div style elements, we need a separate text-parsing routine.

#!/usr/bin/perl
use strict;
use warnings;
use XML::Parser;
my $p = XML::Parser->new( Style => "Tree" );
my $input = "life.xml";
my $depth = 0;  # For pretty indentation of HTML.
my $cache = ''; # Cached formatted text for file output.
traverse( $p->parsefile( $input ) );
exit(0); 

The $depth will be incremented each time we enter the traverse routine, and decremented when we leave: this will allow us to keep track of how deep down the XML tree we are, and allow us to indent the resultant HTML prettily. Yes, this is just window-dressing. $cache will be used to store up parsed and formatted HTML output. Although this could get memory-heavy if the XML file is large, to make the output as pretty as possible, we need to be able to check whether our nodes have any children, but we can't know that within a naïve recursive solution. If we were less bothered about prettiness, we could just print the formatted text out as we went along.

sub traverse
{
    # This subroutine is called recursively to convert an XML list-of-lists-etc
    # into an HTML directory-of-files-of-lists-of-lists-etc.
    $depth++;
    my $node = shift;
    while ( my ( $element, $child ) = splice @{ $node }, 0, 2 )
    {
        if ( ref $child )
        {
            # This deals with mark-up nodes.
            my %attr = %{ shift @{ $child } };
            # For the sake of consistency, we'll put all the data associated 
            # with this node into %attr.
            $attr{element} = $element;
            $attr{taxon}   = ucfirst $element;
            $attr{name}    = ucfirst $attr{name};
            ( $attr{font}  = $element ) =~ s/^(infra|sub|super)//;
                        $attr{font}    = 'division' if $attr{font} eq 'phylum';
            if ( $attr{element} eq "domain" )
            {
                # Each domain gets its own HTML file.
                $depth = 0;  # Reset depth. 
                $cache = ''; # Empty cache.
                # Descriptions may contain <latin>, so we need to parse these
                # appropriately.
                $attr{description} = get_text( $child ); 
                open my $OUTPUT, ">", "$attr{name}.html" 
                    or die "Can't open $attr{name}.html for writing: $!\n";
                # Boilerplate...
                print $OUTPUT <<"THIS";
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta name="generator" content="xmlparser.pl" />
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
  <title>$attr{name}</title>
  <meta name="author" content="Steve Cook" />
  <meta name="copyright" content="Copyright © 2005 Steve Cook. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation." />
  <meta http-equiv="content-language" content="en" />
  <meta name="keywords" content="$attr{name}" />
  <meta name="description" content="The domain $attr{name}" />
  <style type="text/css" media="screen">
  <!--
    body          { font-size: 10pt; }
    span.domain   { font-size: 30pt; }
    span.empire   { font-size: 26pt; }
    span.kingdom  { font-size: 22pt; }
    span.division { font-size: 18pt; }
    span.class    { font-size: 14pt; }
    p.status      { font-style: italic; }
  -->
  </style>
</head>
<body>
<p><span class="domain">$attr{taxon} $attr{name}</span></p>
<p class="status">$attr{status}</p>
<p>$attr{description}</p>
<ul>
THIS
                # Now we traverse the child node, which will contain
                # <kingdom>s, text, etc., gathering the parsed text in the 
                # cache. We could print this immediately, but see below 
                # for why we don't.
                traverse( $child );
                print $OUTPUT $cache;
                # Boilerplate...
                print $OUTPUT <<"THIS";
</ul>
</body>
</html>
THIS
                close $OUTPUT;
            }            

This section of the code deals with the topmost node type, domain. Most of the code above is just HTML boilerplate (and could better be dumped into an HTML::Template). We print out the details we crib from the domain node's attributes. We allow a recursive call to traverse to parse the child node of the domain into the $cache, which we then print, followed by a footer.

            else
            {
                # This deals with all nodes below <domain>, which need to be
                # put in HTML lists-of-lists-of-lists-etc.
                my $padding = '  ' x $depth; # HTML pretty printing.
                $attr{description} = get_text( $child );
                # Again, descriptions may contain <latin>, so we need to parse
                # these appropriately.
                $cache  .= "$padding<li><span class=\"$attr{font}\">$attr{taxon} $attr{name}</span>\n";
                $cache  .= "$padding  <p class=\"status\">$attr{status}</p>\n" if $attr{status};
                $cache  .= "$padding  <p>$attr{description}</p>\n";
                $cache  .= "$padding  <ul>\n";
                traverse( $child );
                $cache  .= "$padding  </ul>\n";
                $cache  =~ s/$padding  <ul>\n$padding  <\/ul>\n$//;
                    # If the traversal added no data, we delete the empty list.
                    # There are probably less hack-ish ways of doing this. Note
                    # that this is why we don't just print as we go along.
                $cache  .= "$padding</li>\n";
            }        
        }
        else
        {
            # This deals with text nodes.
            $cache .= get_text( $child );
        }
    }
    $depth--;
}

The remaining taxonomic elements just need to be dumped into ever more deeply nested HTML lists. The problem here is that until we call traverse, we can't know whether the node we are currently dealing with has any children that need nesting (<ul>...blah...</ul>) in this way. Although we could use some clever look-ahead, it's easier to just do the recursion, and clip off any empty child lists after the event.

If the element is not taxonomic, we call get_text (which we also use to parse the text for the description). This accumulates text into a text-cache, and returns it to the main traverse routine. Although it might be nice to simply assume that the text will be in one nice chunk, headed by a '0', the chances are it will have <latin> child nodes embedded within it, so we need two while loops to pull the data out of the $node until we run out of text or Latin items.

Apart from this, the routine also strips leading and trailing whitespace, and also gets rid of internal tabs and newlines that were only present in the XML as formatting prettification. Note that this will reduce those pesky empty '0', '\n' nodes that we glossed over in the parse tree to empty strings.

sub get_text
{    
    # This subroutine is called whenever we need to parse text, which may have
    # embedded <latin> sections. This cannot be done by traverse, since 
    # this will en-list the Latin names. If we had a DTD, this would be made
    # explicit by only allowing <latin> within text data (and disallowing 
    # nesting). The subroutine can cope with both vanilla text, and
    # [ '0', 'description', '0', 'more text', 'kingdom', [ ...divisions...] ]
    # nodes.
    my $node = shift;
    my $text = ''; # Cached text.
    if ( ref $node )
    {        
        # Note that because $node is a reference, when we modify it with splice,
        # we are modifying the data structure that traverse works on. This means
        # we can splice out text nodes (as with the description), but leave any 
        # element child nodes (like a <division> within an <kingdom>) for 
        # traverse to deal with.    
        while ( defined $node->[0] && $node->[0] eq '0' )
        {
            # Text is found with '0' as its 'element' name. 
            my ( $marker, $data ) = splice @{ $node }, 0, 2;
            $text .= $data;
            while ( defined $node->[0] && $node->[0] eq "latin" )
            {
                # Deal with embedded <latin> nodes.
                my ( $element, $latin ) = splice @{ $node }, 0, 2;
                $text .= "<em>$latin->[2]</em>";
            }
        }
    }
    else
    {
        $text = $node;
    }
    # Strip unwanted whitespace.
    $text =~ s/^\s+//sg; # Leading.
    $text =~ s/\s+$//sg; # Trailing.
    $text =~ s/\s*\n\t+/ /sg;
        # Internal linebreaks. You could do something clever with text wrapping
        # and $depth if you wanted, but we'll take the 'simple' approach. 
    return $text;
}

The parser is available here in its entirety as a text file, and the output is viewable here: Archaea, Bacteria and Eukarya. There are dozens of other modules you can use: just try searching CPAN for XML manipulation, and regular columns on www.xml.com and www.perl.com dealing with perl/XML interaction, so you should be able to find something out there that fits your needs. Hope this brief tutorial helped.

Simple XML

The XML::Parser module may be a little heavyweight for your liking, especially if your liking is simply to use an XML file as a persistent pile of configurations (an simple, extensible and sensible alternative to the likewise sensible use of database-tied hashes). For such occasional, may I recommend XML::Simple. This module is specifically designed to slurp up XML files and convert them into perl data structures. For example, if you had an FTP script that needed access to a list of servers, logins and passwords, you could create a neat little XML file like this:

<?xml version="1.0"?>
<creds>
    <name>Site1<name>
    <server>ftp://ftp.bar.com<server>
    <login>mrfibble</login>
    <password>n0dul3s</password>
</creds>

After saving this as creds.xml, it is the simple matter of using the following to get at the credentials:

use XML::Simple;
my $details= XMLin( "creds.xml" );
print "Logging into $details->{name}...\n";
ftp
( 
    server   => $details->{server}, 
    login    => $details->{login}, 
    password => $details->{password}, 
    files    => \@ARGV,
);
sub ftp { ... }

There's lots more to this module (if you're in any doubt as to the data structure that you get from your XML file, just use Data::Dumper to serialise the structure), but for now let's get onto parsing less well behaved things than XML…

Next…