Lesson 19

Design and technology

If parsing was a little too mathematically esoteric for you, how about something a little more down to Earth? Like publishing webpages. Writing CGI scripts is probably what Perl is used for most, and like most things, there is an Easy Way, and there is a Hard Way. As usual, the Hard Way looks deceptively simple at first, but eventually blows up in your face (this is the voice of experience talking, so take heed of my mistakes!). The Easy Way seems a little more complex at first, but it'll save you weeks of effort when it comes to changing things later.

As far as the Easy Way for CGI scripts goes, this is likely to mean using two modules, one to deal with your design (i.e. the HTML), and the other to deal with your technology (i.e. the Perl). We'll be looking at two stalwart modules for these purposes in this lesson:

There are very good reasons to extricate your design and your technology, not least of which is that it makes it easier to radically change one (and you will, even if you think you won't) with minimal impact on the other. The other reason is simple tidiness: it is much easier to edit scripts that aren't spattered with HTML markup, and it's much easier to change HTML if it's not interspersed with neatly indented code - PHP and ASP afficionados may feel free to disagree ☺

Just to give you a brief taster of the techniques available to those fool who would follow the Hard Way (such as my earlier self): parsing CGI script GET and POST data is not quite as simple as it might seem. At the very least, you will need something like this:

sub parse
{
    my %env = @_;
    my ( @pairs, %formdata );
    if ( $env{ REQUEST_METHOD } eq 'GET' )
    {
        @pairs = split /&/, $env{ QUERY_STRING };
    }
    elsif ( $env{ REQUEST_METHOD } eq 'POST' )
    {
        read ( STDIN, $_, $env{ CONTENT_LENGTH } );
        @pairs = split /&/, $_;
        if ( $env{ QUERY_STRING } )
        {
            my   @getpairs = split /&/, $env{ QUERY_STRING };
            push @pairs, @getpairs;
        }
    }
    foreach ( @pairs )
    {
        my ( $key, $value ) = split /=/;
        $key =~ tr/+/ /;
        $key =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack( "C", hex($1) )/eg;
            # remove hex encoding
        $value =~ tr/+/ /;
        $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack( "C", hex($1) )/eg;
            # remove hex encoding
        $value =~ s/<!--.*-->//sg;
            # remove server side includes (badly)
        if ( $formdata{ $key } )
        {
            $formdata{ $key } .= ", $value";
        }
        else
        {
            $formdata{ $key } = $value;
        }
    }
    return %formdata;
}

And that's just to get the data out of the sundry envelopes it can arrive in, ungibberished to get back all those characters query strings don't like, and into a usable hash. Ugh.

Churning out HTML to clients is also deceptively simple the Hard Way: we just need to give the CGI script a heredoc, like:

#!/usr/bin/perl
use strict;
my %formdata = parse ( %ENV );
print << "THIS";
  # use "" around your heredoc marker and it'll interpolate 
  # $foo and $bar just like a "" quoted string
Content-type: text/html
<html>
<head>
<title>Unwise words</title>
</head>
<body>
<p>Looks deceptively simple, doesn't it $formdata{name}?</p>
<p>Just whatever you do, don't forget the Content-type header line </p>
<p>(and the blank line following) or it won't work.</p>
<p>Oh, and don't forget to put a suitable shebang in,</p>
<p>And don't forget to chmod 0755 the script after you've FTP-ed it</p>
<p>(in ASCII mode of course).</p>
</body>
</html>
THIS

You have now been provided with enough rope to hang yourself with. With the parse subroutine and the heredoc I have shown here, you can write CGI scripts that, for the most part, work (for some definition of 'work'), dynamically providing content to passing browsers. So, if you like things that come and bite you on the arse six months down the line, then read no further. However, if you would rather not be tearing your hair out in six months over how precious your hand-rolled parser is about its input, and how unmaintainable your HTML/code chimaera is, then read on.

Grubby little hands

Here's the Easy Way to start writing a CGI script. This is very similar to the template I use for my own CGI scripts:

#!/usr/bin/perl -T
use strict;
use warnings;
#use CGI::Carp qw( fatalsToBrowser ); 
    # we can uncomment this if we want debugging information in the browser
use CGI qw();
my $cgi = CGI->new();
use HTML::Template;
my $template = HTML::Template->new( filename => "template.html" );
# ...time passes...
print $cgi->header();
print $template->output();
exit( 0 );
__END__
=head1 NAME script.cgi
=head1 SYNOPSIS
perl -T script.cgi
=head1 DESCRIPTION
Does what it says on the tin.

We'll ignore the POD (you know all about POD already), the use strict;, and the use warnings; (the modern way of saying -w). I have mentioned use CGI::Carp qw( fatalsToBrowser) before as a neat way of getting debugging information sent to the browser rather than to some Apache error log somewhere or other. However, the new things here are the taint checking (-T), which we only touched on briefly before, and the CGI and HTML::Template modules.

First the taint checking. Taintedness is the property of scalar data that a user has had his or her grubby hands on. Since users cannot be trusted to not break your scripts (or indeed, your system) through incompetence or malice, perl puts a little sticky label on data entered by users from filehandles or set by users in their environment saying, "Tainted! Do not use this data in anything dangerous!". This means that:

#!/usr/bin/perl -T
use strict;
chomp( my $tainted = <STDIN> );
system $tainted;

will barf (for several reasons) and prevent your script from executing input from STDIN like "rm -rf *". In fact, if you execute this script from the command line:

perl script.pl

the script will actually barf with:

Too late for -T option

Although #!/usr/bin/perl -T will work fine in CGI scripts, it will not work if you call the script from the command line, unless you call it like this:

perl -T script.pl

The reason for this is that by the time perl has compiled your code, it has already had to deal with tainted environment variables, hence the "too late" whinge. Just remember: the -T shebang will work fine in your CGI scripts, but if you are calling a taint-checking script from the command line, you'll need to use the -T switch on the command line instead/as well.

perldoc perlsec will give you an exhaustive run-down of everything to do with taint checking, but the following are rough rules of thumb:

Tainted data propagates to any other variable you use it in, so the following will still (thankfully) barf:

#!/usr/bin/perl -T
use strict;
$ENV{PATH} = "C:/cygwin/bin/";
chomp( my $tainted = <STDIN> );
my $still_tainted = "$tainted" . "; echo All gone";
system $still_tainted;

Note that we have set $ENV{PATH}: since %ENV is considered tainted, if we miss out this line, the script won't even get to the dodgy system call before it chucks a fit.

Taint-checking won't catch every possible insecure thing your user can do, but it does go a long way to avoiding common pitfalls. Cleaning tainted data is quite simple: you just need to reset it explicitly, thereby overwriting the tainted data, before you can use it in a potentially dangerous construct. There are two simple ways of doing this: either set the tainted variable yourself, ignoring the user's preferences (like we did with $ENV{PATH}), or use a pattern match to capture a valid, untainted value:

#!/usr/bin/perl -T
use strict;
$ENV{PATH} = "C:/cygwin/bin/";
chomp( my $tainted = <STDIN> );
my ( $laundered ) = $tainted =~ /^(echo)$/i;
$laundered .= " All OK now";
system $laundered and die $?, $!;

That's all there really is to taint checking: it's just a simple way of hugely increasing the security of your CGI scripts by the simple measure of positively vetting user-inputted data for validity and un-dodginess whenever it is used in potentially dangerous situations.

Form parsing

The CGI module is a monster. Not only does it allow you to use it for parsing form data and cookies, it's also an HTML editor of sorts. We will not be using this latter functionality at all in this lesson, as we have a neater solution for this (HTML::Template). We'll be using CGI purely as a way of getting GET and POST data and cookies out of CGI forms. To cut down on the amount of junk we parse and import, we use one of these lines to include the CGI module:

use CGI qw( :cgi ); # if we plan on using CGI imperatively (see below),
# this just imports the form processing functions and
# none of the HTML generating stuff

or:

use CGI qw(); # if we plan on using CGI in an object-oriented style,
# there's no need to import anything, so we say this 
# explicitly with an empty import list, use MODULE qw();

I assume you know how to create HTML forms, but as a quick summary, forms are ways of sending data from a webpage to a server to generate dynamic content. Below you will find the various elements rigged up to the script we'll be playing with presently, echo.cgi, which simply spits back form parameters you pass it.

OK, that was rather more an exhaustive list than a summary, but nevermind. When this form data is sent to the server, it can arrive in various ways, which the parse subroutine I outlined above will do a reasonable job of decoding. However, it is far easier to use the CGI module to do this for you. There are two main approaches to this. Whichever you use, you will need to have this at the top of your script somewhere:

#!/usr/bin/perl -T
use strict;
use CGI qw( :cgi );
my $cgi = CGI->new();

CGI is actually capable of being used as an object oriented module, or as a typical, imperative module. We'll use the OO syntax here, as I prefer it. Fell free to do the other thing. The two main approaches to getting data out of the $cgi object are Vars() and param(). The former returns the form data as a hash of name => value pairs, the latter returns a single value given a name. Do whichever seems more appropriate, either this:

my %formdata = $cgi->Vars();
print $formdata{ scrawl };

or this:

my $scrawl_value = $cgi->param( 'scrawl' );
print $scrawl_value;

Another useful method is the header() method. This prints out the "Content-type: text/html\n\n" required for all CGI scripts generating HTML output on Apache webservers (that's most perl CGIs).

print $cgi->header();

In fact, these are the only three methods I ever really use from CGI! However, it is nice to know that these methods have been tried and tested by a bazillion perl scripters, and the chances of them having any shallow bugs left are very small, which I cannot say for the dodgy parse() subroutine I outlined earlier.

Sometimes you will want to set cookies, or read them, so as to save the user's configurations (such as whether they would like their pages in English or Esperanto) between sessions. CGI will do this too. To create cookies, all you need to do is use:

$generic_cookie = $cgi->cookie( -name => "cookie_name", -value => "cookie_value" );
$language_cookie = $cgi->cookie( -name => "language" -value => "Esperanto" );

These return cookie objects that can be used when you print out the HTML header:

$cgi->header( -cookie => $generic_cookie ); # set one cookie
$cgi->header( -cookie => [ $generic_cookie, $language_cookie ] ); 
    # set several cookies

They are easily retrieved using:

my $cookie = $cgi->cookie( 'cookie_name' );
my $preferred_language = $cgi->cookie( 'language' );

That lot should be plenty to get you started. We'll now look at the echo script that the forms above sent their data to:

#!/usr/bin/perl -T
#echo v1.0
use strict;
use warnings;
#use CGI::Carp qw( fatalsToBrowser );
use CGI qw( :cgi );
use HTML::Template;
my $cgi = CGI->new();
my $template = HTML::Template->new( filename => "templates/echo.html" );
$template->param
(
    TITLE => "Echo",
    KEYWORDS => "echo, Steve's place",
    DESCRIPTION => "Echoes back form data parameters",
);
my %formdata = $cgi->Vars();
$template->param( SCRIPT_NAME => $ENV{ SCRIPT_NAME } );
$template->param( REQUEST_METHOD => $ENV{ REQUEST_METHOD } );
$template->param( HTTP_REFERER => $ENV{ HTTP_REFERER } );
$template->param( HTTP_USER_AGENT => $ENV{ HTTP_USER_AGENT } );
my @formdata;
while ( my ( $k, $v ) = each %formdata )
{
    push @formdata, { NAME => $k, VALUE => $v };
}
$template->param( FORMDATA => \@formdata );
print $cgi->header();
print $template->output();
exit( 0 );
__END__

The parts in bold should be quite clear: the script runs under warnings, strict and taint, as a secure and well written script should do. It uses CGI, and gets the form data out into a hash using the Vars() method. It then prints out the HTTP header. The rest of the script is there to generate the HTML itself, using HTML::Template, which is out next port of call.

Templates

HTML::Template is a simple and clean way of generating HTML dynamically. There are more complex ways (like Mason), and simpler ways (like heredocs), but HTML::Template seems to tread a nice path between these extremes, and neatly disconnects the majority of the code from the design. The modules allows three main constructs in the HTML template: variables, loops and conditionals, which is about as complex as you can embed into HTML without severely entangling the design with the technology. Here is the template for the echo script:

<html>
  <head>
    <title>Steve's place- <TMPL_VAR NAME="TITLE"></title>
    <meta name="keywords" content="<TMPL_VAR NAME="KEYWORDS">>
    <meta name="description" content="<TMPL_VAR NAME="DESCRIPTION">">
    <link rel="stylesheet" type="text/css" href="../style.css">
  </head>
  <body>
    <!--blah, some junk omitted here-->
    <h1>Your script parameters were...</h1>
    <p>Action (SCRIPT_NAME) = <b><TMPL_VAR NAME="SCRIPT_NAME"></b></p>
    <p>Method (REQUEST_METHOD) = <b><TMPL_VAR NAME="REQUEST_METHOD"></b></p>
    <TMPL_IF NAME="HTTP_REFERER">
    <p>Referrer (HTTP_REFERER) = <b><TMPL_VAR NAME="HTTP_REFERER"></b></p>
    <TMPL_ELSE>
    <p>Referrer (HTTP_REFERER) = <b>Direct request</b></p>
    </TMPL_IF>
    <p>Browser (HTTP_USER_AGENT) = <b><TMPL_VAR NAME="HTTP_USER_AGENT"></b></p>
    <p>Form data</p>
    <ul>
      <TMPL_LOOP NAME="FORMDATA">
      <li><TMPL_VAR NAME="NAME"> = <b><TMPL_VAR NAME="VALUE"></b></li>
      </TMPL_LOOP>
    </ul>
    <!--blah, some junk omitted here-->
  </body>
</html>

I have marked the 'gaps' that HTML::Template will fill in in bold to make them clearer. HTML::Template has three important methods. The first is new():

my $template = HTML::Template->new( filename => "templates/echo.html" );

This creates a templating object which will fill in the gaps in a file called templates/echo.html, which is the very thing shown above. The second important method is param(), which takes a hash of name => value pairs:

$template->param( template_variable_name => "value to substitute in" );
$template->param( SCRIPT_NAME => $ENV{ SCRIPT_NAME } );

When the template is printed out, any occurrence of the tag:

<TMPL_VAR NAME="SCRIPT_NAME">

in the template will be replaced with the value of $ENV{ SCRIPT_NAME } (e.g. "cgi-bin/echo.cgi"). If you compare the echo script and the echo template, you will see the script sets several TMPL_VARs in the same way, such as TITLE and HTTP_REFERER, and in the template, you will find the tags <TMPL_VAR NAME="TITLE"> and <TMPL_VAR NAME="HTTP_REFERER">.

It really is that simple! For simple incorporation of scalar variables into the output, all you need to do is set:

$template->param( BLAH => "FIBBLE" );

in your CGI script, and incorporate the corresponding named TMPL_VAR tag:

<p><TMPL_VAR NAME="BLAH"></p>

or similar into your HTML template. When you come to use the third method of HTML::Template, output(), the template object will generate this:

print $template->output();
<p>FIBBLE</p>

This accounts for about half of the variables in the echo script. However, the module, as I said, also allows for conditionals and loops. To create loops, rather than using a simple hash:

Script:
$template->param( SCRIPT_NAME => $ENV{ SCRIPT_NAME } );
Template:
<TMPL_VAR NAME="SCRIPT_NAME">

you use a reference to an array of hashrefs instead:

Script:
my @formdata;
while ( my ( $k, $v ) = each %formdata )
{
    push @formdata, { NAME => $k, VALUE => $v }; 
      # create an array of hashrefs
}
$template->param( FORMDATA => \@formdata ); 
  # give param a reference to this array of hashrefs
Template:
<TMPL_LOOP NAME="FORMDATA">
    <li><TMPL_VAR NAME="NAME"> = <b><TMPL_VAR NAME="VALUE"></b></li>
</TMPL_LOOP>

to generate something like this:

<li>language = Esperanto</li>
<li>encoding = UTF8</p>
<li>...

If you pass the param() method a ( FOO => \@array_of_hashrefs ) pair, the module will look for a corresponding <TMPL_LOOP NAME="FOO"></TMPL_LOOP> pair in the template. So in this case, we define an arrayref called FORMDATA, which contains a number of { NAME => "language", VALUE => "Esperanto" } hashrefs in the script. When we send this data to the template, it sets <TMPL_VAR NAME="NAME"> and <TMPL_VAR NAME="VALUE"> to each of the corresponding values from the loop variable. This actually makes it sound more complicated than it really is: if you just read the code, it makes intuitive sense.

To create conditionals is just as easy:

Script:
$template->param( HTTP_REFERER => $ENV{ HTTP_REFERER } );
Template:
<TMPL_IF NAME="HTTP_REFERER">
    <p>Referrer (HTTP_REFERER) = <b><TMPL_VAR NAME="HTTP_REFERER"></b></p>
<TMPL_ELSE>
    <p>Referrer (HTTP_REFERER) = <b>Direct request</b></p>
</TMPL_IF>

We set a parameter in the template object called HTTP_REFERER in the script. In the template, if this is TRUE, then the HTML between the <TMPL_IF NAME="HTTP_REFERER"></TMPL_IF> will be filled in appropriately and outputted. You can also (as we have done here), specify a <TMPL_ELSE> within this structure to be filled in and outputted if HTTP_REFERER is FALSE. Simple.

And that's all there is to it. My search script, guestbook, Madame Perlmina, consensus script, error documents and image embedder all use these two basic modules, and it has been a huge and wonderful relief how much tidier and maintainable this has made them. So learn from my mistakes, and do it the Easy Way from the start!

Next…