Design and technology
If parsing was a little too mathematically esoteric for you, how about something a little more down to Earth? Like publishing webpages. Writing CGI scripts is probably what Perl is used for most, and like most things, there is an Easy Way, and there is a Hard Way. As usual, the Hard Way looks deceptively simple at first, but eventually blows up in your face (this is the voice of experience talking, so take heed of my mistakes!). The Easy Way seems a little more complex at first, but it'll save you weeks of effort when it comes to changing things later.
As far as the Easy Way for CGI scripts goes, this is likely to mean using two modules, one to deal with your design (i.e. the HTML), and the other to deal with your technology (i.e. the Perl). We'll be looking at two stalwart modules for these purposes in this lesson:
HTML::Templatefor your designCGIfor the underpinning of your technology
There are very good reasons to extricate your design and your technology, not least of which is that it makes it easier to radically change one (and you will, even if you think you won't) with minimal impact on the other. The other reason is simple tidiness: it is much easier to edit scripts that aren't spattered with HTML markup, and it's much easier to change HTML if it's not interspersed with neatly indented code - PHP and ASP afficionados may feel free to disagree ☺
Just to give you a brief taster of the techniques available to those
fool who would follow the Hard Way (such as my earlier self): parsing CGI
script GET and POST data is not quite as simple
as it might seem. At the very least, you will need something like
this:
sub parse
{
my %env = @_;
my ( @pairs, %formdata );
if ( $env{ REQUEST_METHOD } eq 'GET' )
{
@pairs = split /&/, $env{ QUERY_STRING };
}
elsif ( $env{ REQUEST_METHOD } eq 'POST' )
{
read ( STDIN, $_, $env{ CONTENT_LENGTH } );
@pairs = split /&/, $_;
if ( $env{ QUERY_STRING } )
{
my @getpairs = split /&/, $env{ QUERY_STRING };
push @pairs, @getpairs;
}
}
foreach ( @pairs )
{
my ( $key, $value ) = split /=/;
$key =~ tr/+/ /;
$key =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack( "C", hex($1) )/eg;
# remove hex encoding
$value =~ tr/+/ /;
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack( "C", hex($1) )/eg;
# remove hex encoding
$value =~ s/<!--.*-->//sg;
# remove server side includes (badly)
if ( $formdata{ $key } )
{
$formdata{ $key } .= ", $value";
}
else
{
$formdata{ $key } = $value;
}
}
return %formdata;
}
And that's just to get the data out of the sundry envelopes it can arrive in, ungibberished to get back all those characters query strings don't like, and into a usable hash. Ugh.
Churning out HTML to clients is also deceptively simple the Hard Way: we just need to give the CGI script a heredoc, like:
#!/usr/bin/perl
use strict;
my %formdata = parse ( %ENV );
print << "THIS";
# use "" around your heredoc marker and it'll interpolate
# $foo and $bar just like a "" quoted string
Content-type: text/html
<html>
<head>
<title>Unwise words</title>
</head>
<body>
<p>Looks deceptively simple, doesn't it $formdata{name}?</p>
<p>Just whatever you do, don't forget the Content-type header line </p>
<p>(and the blank line following) or it won't work.</p>
<p>Oh, and don't forget to put a suitable shebang in,</p>
<p>And don't forget to chmod 0755 the script after you've FTP-ed it</p>
<p>(in ASCII mode of course).</p>
</body>
</html>
THIS
You have now been provided with enough rope to hang yourself with.
With the parse subroutine and the heredoc I have shown here,
you can write CGI scripts that, for the most part, work (for some
definition of 'work'), dynamically providing content to passing browsers.
So, if you like things that come and bite you on the arse six months down
the line, then read no further. However, if you would rather not be
tearing your hair out in six months over how precious your hand-rolled
parser is about its input, and how unmaintainable your HTML/code chimaera
is, then read on.
Grubby little hands
Here's the Easy Way to start writing a CGI script. This is very similar to the template I use for my own CGI scripts:
#!/usr/bin/perl -T
use strict;
use warnings;
#use CGI::Carp qw( fatalsToBrowser );
# we can uncomment this if we want debugging information in the browser
use CGI qw();
my $cgi = CGI->new();
use HTML::Template;
my $template = HTML::Template->new( filename => "template.html" );
# ...time passes...
print $cgi->header();
print $template->output();
exit( 0 );
__END__
=head1 NAME script.cgi
=head1 SYNOPSIS
perl -T script.cgi
=head1 DESCRIPTION
Does what it says on the tin.
We'll ignore the POD (you know all about
POD already), the use strict;, and the use
warnings; (the modern way of saying -w). I have mentioned use CGI::Carp qw(
fatalsToBrowser) before as a neat way of getting debugging
information sent to the browser rather than to some Apache error log
somewhere or other. However, the new things here are the taint checking (-T), which
we only touched on briefly before, and the CGI and
HTML::Template modules.
First the taint checking. Taintedness is the property of scalar data that a user has had his or her grubby hands on. Since users cannot be trusted to not break your scripts (or indeed, your system) through incompetence or malice, perl puts a little sticky label on data entered by users from filehandles or set by users in their environment saying, "Tainted! Do not use this data in anything dangerous!". This means that:
#!/usr/bin/perl -T use strict; chomp( my $tainted = <STDIN> ); system $tainted;
will barf (for several reasons) and prevent your script from executing
input from STDIN like "rm -rf *". In fact, if
you execute this script from the command line:
perl script.pl
the script will actually barf with:
Too late for -T option
Although #!/usr/bin/perl -T will work fine in CGI
scripts, it will not work if you call the script from the
command line, unless you call it like this:
perl -T script.pl
The reason for this is that by the time perl has compiled your code,
it has already had to deal with tainted environment variables, hence the
"too late" whinge. Just remember: the -T shebang will work
fine in your CGI scripts, but if you are calling a taint-checking script
from the command line, you'll need to use the -T switch on
the command line instead/as well.
perldoc perlsec will give you an exhaustive run-down of
everything to do with taint checking, but the following are rough rules
of thumb:
- Commands that are considered unsafe: the single argument forms of
systemandexec,`backticks`, any command that spawns a sub-shell, all functions that modify files, likeopen ">$file", certain directory functions, likechdir. - Data from the user that are considered tainted:
@ARGV,%ENV(including$ENV{PATH}, so you may well need to set this explicitly in taint-checking scripts), the "." entry in@INC, output fromreaddirandglob, any data taken from filehandles, likeSTDIN.
Tainted data propagates to any other variable you use it in, so the following will still (thankfully) barf:
#!/usr/bin/perl -T
use strict;
$ENV{PATH} = "C:/cygwin/bin/";
chomp( my $tainted = <STDIN> );
my $still_tainted = "$tainted" . "; echo All gone";
system $still_tainted;
Note that we have set $ENV{PATH}: since %ENV
is considered tainted, if we miss out this line, the script won't even
get to the dodgy system call before it chucks a fit.
Taint-checking won't catch every possible insecure thing your user can
do, but it does go a long way to avoiding common pitfalls. Cleaning
tainted data is quite simple: you just need to reset it explicitly,
thereby overwriting the tainted data, before you can use it in a
potentially dangerous construct. There are two simple ways of doing this:
either set the tainted variable yourself, ignoring the user's preferences
(like we did with $ENV{PATH}), or use a pattern match to
capture a valid, untainted value:
#!/usr/bin/perl -T
use strict;
$ENV{PATH} = "C:/cygwin/bin/";
chomp( my $tainted = <STDIN> );
my ( $laundered ) = $tainted =~ /^(echo)$/i;
$laundered .= " All OK now";
system $laundered and die $?, $!;
That's all there really is to taint checking: it's just a simple way of hugely increasing the security of your CGI scripts by the simple measure of positively vetting user-inputted data for validity and un-dodginess whenever it is used in potentially dangerous situations.
Form parsing
The CGI module is a monster. Not only does it allow you
to use it for parsing form data and cookies, it's also an HTML editor of
sorts. We will not be using this latter functionality at all in this
lesson, as we have a neater solution for this
(HTML::Template). We'll be using CGI purely as
a way of getting GET and POST data and cookies
out of CGI forms. To cut down on the amount of junk we parse and import,
we use one of these lines to include the CGI module:
use CGI qw( :cgi ); # if we plan on using CGI imperatively (see below), # this just imports the form processing functions and # none of the HTML generating stuff
or:
use CGI qw(); # if we plan on using CGI in an object-oriented style, # there's no need to import anything, so we say this # explicitly with an empty import list, use MODULE qw();
I assume you know how to create HTML forms, but as a quick summary, forms are ways of sending data from a webpage to a server to generate dynamic content. Below you will find the various elements rigged up to the script we'll be playing with presently, echo.cgi, which simply spits back form parameters you pass it.
- Forms are delimited by:
<form action="script.cgi" method="POST"></form>
The action,
script.cgiis the name of the CGI script that will handle the form data: often this will be the same as the script that generates the form, and the script will essentially be a large if/else statement: -
my $something_to_look_for = $cgi->param( 'search' ); if ( $something_to_look_for ) { parse_and_respond_to_form_data( $something_to_look_for ); } else { generate_html_search_form(); }The method,
POST, is the HTTP method you use to send data to the server. If the method isPOST, the data will be sent to the server and will appear to your script onSTDIN. UsePOSTto send large quantities of data that should not be bookmarkable, like guestbook submissions. If the method isGET, the data will be sent to the server appended after a?as part of the URL, as in http://www.steve.gb.com/cgi-bin/search.cgi?search=%24_&boolean=all and will appear to your script as$ENV{QUERY_STRING}. This sort of data can be bookmarked, so it's useful for dictionaries, search scripts, etc. - Forms can contain several sorts of element, listed below. In each
case, the code and the resultant HTML form are shown. Note that all
these elements (except submit and reset), will somehow lead to the
generation of a
name=valuepair that will be sent to the script. In a GET form, these will appear in the query_string portion of the URL (script.cgi?name1=value1&name2=value2&name3=value3), as you can see by clicking on the "Echo" button below. Note that if several parameters are sent to the CGI script, they are separated by&characters, that spaces are encoded as+characters, and various special characters like$and!are encoded into hexadecimal (%24and%21respectively: just try entering these into the text box below).
- Text input, passwords, submit and reset input elements
Plain input text <input type="text" name="scrawl" value="perl $!" size="10"> Password input text <input type="password" name="password" size="10"> <input type="submit" value="Echo"> <input type="reset" value="Reset">
The first element is a text input box, with character width (10) and default value (perl $!). The second element is a password input box, which is identical, but displays stars or blobs rather than what is typed (if you use passwords, you'll want to
POSTthe data, or it'll appear in the query_string!). The last two are special form elements (they do not appear on the echo output, as they submit no data). The submit element is responsible for the submit (echo) button, and the reset element is responsible for the button which resets the form to its pristine condition. The text in these buttons may be modified with their value attributes. - Hidden input and option lists
<input type="hidden" name="invisible" value="nevidebla"> <select name="species"> <option value="Dionaea" selected="selected"> Dionaea muscipula</option> <option value="Drosera">Drosera capensis</option> </select>Hidden elements do not show up on the form in the browser, so they can be used to send data to the script that the user need not worry themselves with. Select lists provide a drop down listbox of values, one of these values may be selected by default using "select" in the attribute list. Note that the data sent to the script (the value) need not necessarily be the same as the data between the
<option></option>pair: here selecting "Dionaea muscipula" sends onlyspecies=Dionaeato the script. Submit and reset button as before. - Radio and check buttons
Would you like a tutorial?<input type="checkbox" name="tutorial" checked="checked"> Which language? Perl <input type="radio" name="language" value="perl" checked="checked"> Python <input type="radio" name="language" value="python">
Checkboxes return a value of "on" if they are checked, and do not appear at all in the form data if they are not checked. They can be checked by default if required using "checked" in the attribute list. Radiobuttons appear in groups with the same name attribute, and one may be defaulted using "checked" too. Submit and reset as before.
- Textareas
<textarea rows="2" name="comments" cols="20">Default crap</textarea>
Textareas are like text input elements, only bigger, and with a weird syntax. If you hadn't noticed, the whole form syntax is a horrible, inconsistent mess. Submit and reset as before.
- Text input, passwords, submit and reset input elements
OK, that was rather more an exhaustive list than a summary, but nevermind. When this form data is sent to the server, it can arrive in various ways, which the parse subroutine I outlined above will do a reasonable job of decoding. However, it is far easier to use the CGI module to do this for you. There are two main approaches to this. Whichever you use, you will need to have this at the top of your script somewhere:
#!/usr/bin/perl -T use strict; use CGI qw( :cgi ); my $cgi = CGI->new();
CGI is actually capable of being used as an object
oriented module, or as a typical, imperative module. We'll use the OO
syntax here, as I prefer it. Fell free to do the other thing. The two
main approaches to getting data out of the $cgi object are
Vars() and param(). The former returns the form
data as a hash of name => value pairs, the latter returns
a single value given a name. Do whichever seems more appropriate, either
this:
my %formdata = $cgi->Vars();
print $formdata{ scrawl };
or this:
my $scrawl_value = $cgi->param( 'scrawl' ); print $scrawl_value;
Another useful method is the header() method. This prints
out the "Content-type: text/html\n\n" required for all CGI
scripts generating HTML output on Apache webservers (that's most perl
CGIs).
print $cgi->header();
In fact, these are the only three methods I ever really use from
CGI! However, it is nice to know that these methods have
been tried and tested by a bazillion perl scripters, and the chances of
them having any shallow bugs left are very small, which I cannot say for
the dodgy parse() subroutine I outlined earlier.
Sometimes you will want to set cookies, or read them, so as to save
the user's configurations (such as whether they would like their pages in
English or Esperanto) between sessions. CGI will do this
too. To create cookies, all you need to do is use:
$generic_cookie = $cgi->cookie( -name => "cookie_name", -value => "cookie_value" ); $language_cookie = $cgi->cookie( -name => "language" -value => "Esperanto" );
These return cookie objects that can be used when you print out the HTML header:
$cgi->header( -cookie => $generic_cookie ); # set one cookie
$cgi->header( -cookie => [ $generic_cookie, $language_cookie ] );
# set several cookies
They are easily retrieved using:
my $cookie = $cgi->cookie( 'cookie_name' ); my $preferred_language = $cgi->cookie( 'language' );
That lot should be plenty to get you started. We'll now look at the echo script that the forms above sent their data to:
#!/usr/bin/perl -T
#echo v1.0
use strict;
use warnings;
#use CGI::Carp qw( fatalsToBrowser );
use CGI qw( :cgi );
use HTML::Template;
my $cgi = CGI->new();
my $template = HTML::Template->new( filename => "templates/echo.html" );
$template->param
(
TITLE => "Echo",
KEYWORDS => "echo, Steve's place",
DESCRIPTION => "Echoes back form data parameters",
);
my %formdata = $cgi->Vars();
$template->param( SCRIPT_NAME => $ENV{ SCRIPT_NAME } );
$template->param( REQUEST_METHOD => $ENV{ REQUEST_METHOD } );
$template->param( HTTP_REFERER => $ENV{ HTTP_REFERER } );
$template->param( HTTP_USER_AGENT => $ENV{ HTTP_USER_AGENT } );
my @formdata;
while ( my ( $k, $v ) = each %formdata )
{
push @formdata, { NAME => $k, VALUE => $v };
}
$template->param( FORMDATA => \@formdata );
print $cgi->header();
print $template->output();
exit( 0 );
__END__
The parts in bold should be quite clear: the script runs under
warnings, strict and taint, as a secure and well written script should
do. It uses CGI, and gets the form data out into a hash
using the Vars() method. It then prints out the HTTP header.
The rest of the script is there to generate the HTML itself, using
HTML::Template, which is out next port of call.
Templates
HTML::Template is a simple and clean way of generating
HTML dynamically. There are more complex ways (like Mason), and simpler
ways (like heredocs), but HTML::Template seems to tread a
nice path between these extremes, and neatly disconnects the majority of
the code from the design. The modules allows three main constructs in the
HTML template: variables, loops and conditionals, which is about as
complex as you can embed into HTML without severely entangling the design
with the technology. Here is the template for the echo script:
<html>
<head>
<title>Steve's place- <TMPL_VAR NAME="TITLE"></title>
<meta name="keywords" content="<TMPL_VAR NAME="KEYWORDS">>
<meta name="description" content="<TMPL_VAR NAME="DESCRIPTION">">
<link rel="stylesheet" type="text/css" href="../style.css">
</head>
<body>
<!--blah, some junk omitted here-->
<h1>Your script parameters were...</h1>
<p>Action (SCRIPT_NAME) = <b><TMPL_VAR NAME="SCRIPT_NAME"></b></p>
<p>Method (REQUEST_METHOD) = <b><TMPL_VAR NAME="REQUEST_METHOD"></b></p>
<TMPL_IF NAME="HTTP_REFERER">
<p>Referrer (HTTP_REFERER) = <b><TMPL_VAR NAME="HTTP_REFERER"></b></p>
<TMPL_ELSE>
<p>Referrer (HTTP_REFERER) = <b>Direct request</b></p>
</TMPL_IF>
<p>Browser (HTTP_USER_AGENT) = <b><TMPL_VAR NAME="HTTP_USER_AGENT"></b></p>
<p>Form data</p>
<ul>
<TMPL_LOOP NAME="FORMDATA">
<li><TMPL_VAR NAME="NAME"> = <b><TMPL_VAR NAME="VALUE"></b></li>
</TMPL_LOOP>
</ul>
<!--blah, some junk omitted here-->
</body>
</html>
I have marked the 'gaps' that HTML::Template will fill in
in bold to make them clearer. HTML::Template has three
important methods. The first is new():
my $template = HTML::Template->new( filename => "templates/echo.html" );
This creates a templating object which will fill in the gaps in a file
called templates/echo.html, which is the very thing shown
above. The second important method is param(), which takes a
hash of name => value pairs:
$template->param( template_variable_name => "value to substitute in" );
$template->param( SCRIPT_NAME => $ENV{ SCRIPT_NAME } );
When the template is printed out, any occurrence of the tag:
<TMPL_VAR NAME="SCRIPT_NAME">
in the template will be replaced with the value of $ENV{
SCRIPT_NAME } (e.g. "cgi-bin/echo.cgi"). If
you compare the echo script and the echo template, you will see the
script sets several TMPL_VARs in the same way, such as
TITLE and HTTP_REFERER, and in the template,
you will find the tags <TMPL_VAR NAME="TITLE"> and
<TMPL_VAR NAME="HTTP_REFERER">.
It really is that simple! For simple incorporation of scalar variables into the output, all you need to do is set:
$template->param( BLAH => "FIBBLE" );
in your CGI script, and incorporate the corresponding named
TMPL_VAR tag:
<p><TMPL_VAR NAME="BLAH"></p>
or similar into your HTML template. When you come to use the third
method of HTML::Template, output(), the
template object will generate this:
print $template->output();
<p>FIBBLE</p>
This accounts for about half of the variables in the echo script. However, the module, as I said, also allows for conditionals and loops. To create loops, rather than using a simple hash:
Script:
$template->param( SCRIPT_NAME => $ENV{ SCRIPT_NAME } );
Template:
<TMPL_VAR NAME="SCRIPT_NAME">
you use a reference to an array of hashrefs instead:
Script:
my @formdata;
while ( my ( $k, $v ) = each %formdata )
{
push @formdata, { NAME => $k, VALUE => $v };
# create an array of hashrefs
}
$template->param( FORMDATA => \@formdata );
# give param a reference to this array of hashrefs
Template:
<TMPL_LOOP NAME="FORMDATA">
<li><TMPL_VAR NAME="NAME"> = <b><TMPL_VAR NAME="VALUE"></b></li>
</TMPL_LOOP>
to generate something like this:
<li>language = Esperanto</li> <li>encoding = UTF8</p> <li>...
If you pass the param() method a ( FOO =>
\@array_of_hashrefs ) pair, the module will look for a
corresponding <TMPL_LOOP NAME="FOO"></TMPL_LOOP>
pair in the template. So in this case, we define an arrayref called
FORMDATA, which contains a number of { NAME =>
"language", VALUE => "Esperanto" } hashrefs in the script. When
we send this data to the template, it sets <TMPL_VAR
NAME="NAME"> and <TMPL_VAR NAME="VALUE"> to
each of the corresponding values from the loop variable. This actually
makes it sound more complicated than it really is: if you just read the
code, it makes intuitive sense.
To create conditionals is just as easy:
Script:
$template->param( HTTP_REFERER => $ENV{ HTTP_REFERER } );
Template:
<TMPL_IF NAME="HTTP_REFERER">
<p>Referrer (HTTP_REFERER) = <b><TMPL_VAR NAME="HTTP_REFERER"></b></p>
<TMPL_ELSE>
<p>Referrer (HTTP_REFERER) = <b>Direct request</b></p>
</TMPL_IF>
We set a parameter in the template object called
HTTP_REFERER in the script. In the template, if this is
TRUE, then the HTML between the <TMPL_IF
NAME="HTTP_REFERER"></TMPL_IF> will be filled in
appropriately and outputted. You can also (as we have done here), specify
a <TMPL_ELSE> within this structure to be filled in
and outputted if HTTP_REFERER is FALSE. Simple.
And that's all there is to it. My search script, guestbook, Madame Perlmina, consensus script, error documents and image embedder all use these two basic modules, and it has been a huge and wonderful relief how much tidier and maintainable this has made them. So learn from my mistakes, and do it the Easy Way from the start!
