Lesson 3

Bondage, discipline and subroutines

You may (although it's unlikely) have noticed a little thing I slipped in the last script: the keyword my in the chomp. my is a very important keyword, although it doesn't seem to make any difference if you delete it and run the program. What my does is pin a variable to a particular part of your program, so that it can't be seen from elsewhere. This may not seem very useful at the moment, but is exceedingly important as your programs get bigger. Such as here:

#!/usr/bin/perl
use strict;
use warnings;
my @peas = qw/chick mushy split/;
while ( my $type = pop @peas )
{
    print "$type peas are ", flavour( $type ), ".\n";
}
sub flavour
{
    my $query = shift @_;
    my @peas = qw/chick garbanzo/;
    foreach ( @peas )
    {
        if ( $query eq $_ )
        {
            return "delicious";
        }
    }
    return "disgusting";
}

Many new things, we'll take it a bit at a time. Most Perl tutorial's I've read leave my until the very end, but it's not really very difficult, and in the interests of getting you into good habits early, we'll take it on now. I've read scripts written for the servers at my university that don't use my, which makes me worry about how well the scripts are coded in other ways. The first way to write well behaved scripts is to bung this at the top:

use strict;

This turns on perl's bondage and discipline mode. In strict mode, if you do not use my (or its big brother, our) on all variables (and therefore safely pin them down to particular bits of your code), perl will barf. Why should you want bondage and discipline? Why should you want to pin variables down to specific places? Well, on little throwaway scripts, you might not, and it's fine not to bother. But on big things, with lots of user defined functions (subroutines), it's essential. We'll get onto exactly what my does in a little while.

The next part of the code goes:

my @peas = qw/chick mushy split/;

i.e. create an array called @peas containing the obvious items. Note the random choice of quoting characters, and ignore the my for the second. Then:

while ( my $type = pop @peas )
    { print "$type peas are ", flavour( $type ), ".\n"; }

Three new things here, the while loop, the pop and the flavour(). We'll take these in turn. while is another loop control, like for and foreach. It has the general form:

while ( THIS_IS_TRUE ) { DO_SOMETHING; }

So when is:

my $type = pop @peas

TRUE then? Well, perl considers anything apart from undefined variables, and the number zero as TRUE. pop is an array operator, which pulls the last member out of an array and returns it (shortening the array by one). Here the popped member is captured each time into the variable $type. Since "chick", "mushy" and "split" are not the number zero, and are most clearly defined as something, $type is TRUE until perl tries to pop a non-existent, undefined, fourth item out of the array, whereupon the loop exits. Which is all very obvious really:

while ( there are still things to pop out of the array ) { DO_SOMETHING; }

So all this loop does is iterate over the array, just like foreach, but empties the array from the end in so doing. Perl has several other sorts of loop, in addition to while, for and foreach loops. This one should be fairly obvious too:

until ( THIS_IS_TRUE ) { DO_SOMETHING; }

We'll get onto loop control (exiting loops prematurely) later.

Perl has plenty of types of loop. It also has plenty of array manipulators. As you now know, pop will pop out the last member of an array. If you want to pull values out of the front end, you'll need shift, which returns the first member of an array, shortening the array by one from the front. If you want to add things to an array, you'll want to use push or unshift, which add things to the end or beginning of an array respectively. For example:

#!/usr/bin/perl
use warnings;
@peas = ( "chick", "mushy", "split" );
print "\@peas contains ( @peas ).\n";
$foo = pop @peas;
# $foo contains "split", @peas now contains ("chick", "mushy")
print "$foo was popped, ( @peas ) are left in \@peas.\n";
$bar = shift @peas;
# $bar contains "chick", @peas now contains just ("mushy")
print "$bar was shifted, ( @peas ) is left in \@peas.\n";
push @peas, "garbanzo";
# @peas now contains ("mushy", "garbanzo")
print "garbanzo was pushed, now \@peas contains ( @peas ).\n";
unshift @peas, "marrowfat";
# @peas now contains ("marrowfat", "mushy", "garbanzo")
print "marrowfat was unshifted, now \@peas contains ( @peas ).\n";
push @peas, $foo, $bar;
# @peans now contains ("marrowfat", "mushy", "garbanzo", "split", "chick")
print "( $foo $bar ) were pushed, now \@peas contains ( @peas ).\n";
@peas contains ( chick mushy split ).
split was popped, ( chick mushy ) are left in @peas.
chick was shifted, ( mushy ) is left in @peas.
garbanzo was pushed, now @peas contains ( mushy garbanzo ).
marrowfat was unshifted, now @peas contains ( marrowfat mushy garbanzo ).
( split chick ) were pushed, now @peas contains ( marrowfat mushy garbanzo split chick ).

push and unshift are list operators, and will add an entire list of things to the array. Bearing in mind an array is just a posh sort of list:

#!/usr/bin/perl
use warnings;
@peas = ( "chick", "mushy", "split" );
@beans = ( "adzuki", "haricot", "mung" );
push @peas, @beans, "and this too";
print "@peas\n";
chick mushy split adzuki haricot mung and this too

will shove the entire contents of @beans onto the end of @peas, followed by the string "and this too".

The least popular array operator is splice. Although splice can do everything pop, push, shift and unshift can do and more, it has a rather difficult syntax.

splice @ARRAY, START_INDEX, THIS_MANY, LIST;

will remove THIS_MANY items starting from START_INDEX, and replace them with the contents of LIST. Incidentally, splice is one of the context sensitive operators: in list context, it will return all the spliced out items, but if you call it in scalar context, it returns just the last item removed from the array, rather than the whole list of them. So:

@all_removed = splice ...;
#list context, because there's an @rray to capture what splice returns
$last_one_removed = splice ...;
#scalar context, because there's only a $calar to capture the output of splice

THIS_MANY and LIST are optional, defaulting to 1 and undefined (undef) respectively.

pop @things;

and

splice( @things, -1, 1, undef );

mean the same thing: both remove a single item (1): the last (-1) member of an array (@things), and replace it with nothing (which is called undef in Perl). pop is more intuitive though. Another useful array operator is reverse:

@backward_peas = reverse @peas;

reverse leaves @peas itself unchanged, but returns the array in reversed order, here to be captured in @reversed. If you want to reverse an array in situ, use:

@array = reverse @array.

Note that some of these operators will only work on arrays, but not on lists. The distinction between an array and a list is similar to that between a scalar and a value: an array is something you can name, like @bits, whereas a list is just a comma-separated list of values in a script. Likewise, $that is a scalar, but 'this' is just a value.

You can slice lists in the same way as you slice arrays:

my @bits = ( 'this', 'is', 'a', 'list', 'not', 'an', 'array' )[ 0 .. 1, 5 .. 6 ];
print "@bits";

However, you cannot pop a list:

my $word = pop ( 'this', 'is', 'a', 'list', 'not', 'an', 'array' );
print $word;
Type of arg 1 to pop must be array (not list).
Execution aborted due to compilation errors.

The reason for this is that although it makes sense that you can slice, or even reverse a list:

print reverse ( qw( t s i l ) );

you cannot remove the last item from a list, because a list is not a variable: to pop a value from the list would be equivalent to taking an eraser to the text of your script, and that is nonsensical.

Giving something back

Anyway, back to the point. The only other new thing in the code we were examining:

while ( my $type = pop @peas )
    { print "$type peas are ", flavour( $type ), ".\n"; }

is the function flavour(). Although Perl has some bizarrely named operators (like chomp, pop, getgrent and dump), flavour is not amongst them. flavour() is a user defined function, or subroutine, which is the next thing to look at. To create a subroutine you need to write something like:

sub NAME { DO_SOMETHING; }

And to call it, you simply need to write

NAME( ARGUMENT_LIST );

The flavour subroutine is called by the body of the program to determine how the three peas of interest taste. Subroutines frequently need to return things to the main part of the program: in this case, flavour() returns what the subroutine thinks about certain sorts of pea. So let's look at how flavour() does this:

sub flavour
{
    my $query = shift @_;
    my @peas = qw/chick garbanzo/;
    foreach ( @peas )
    {
        if ( $query eq $_ )
        {
            return "delicious";
        }
    }
    return "disgusting";
}

Now, the first new thing here is another of perl's infamous punctuation variables, @_. @_ contains a list of all the arguments passed to the subroutine, in this case, whatever the value of $type was when the subroutine was called in the body of the program. For the sake of argument, let's say this is "chick". @_ is just an array, so shift will pull the first member out as it would with any array. So $query will end up containing "chick". Like $_, @_ is assumed by certain operators: in a subroutine, shift will assume @_ if you don't tell it otherwise, hence:

sub blah { $arg = shift @_; }
sub blah { $arg = shift; }
sub blah { ( $arg ) = @_; }

are more-or-less equivalent. I always use the last one, since it's easier to add extra arguments later. In the last one, we have assigned @_ to a [one item long] list (in parentheses):

( $name, $date, $error, @other_things ) = @_;
( $arg ) = @_;

which allows you to refer to the arguments with pretty names, rather than the perfectly valid, but rather painful:

$_[0];
$_[1];
...

Note that you can't just say:

$arg = @_;

if there's only one argument, since the $arg forces scalar context, as we've seen before, and arrays tell you how big they are, not what's in them in this context. The parentheses are required, unless (of course), you actually want to know how many arguments were passed, rather than what arguments were passed. Which is unlikely.

The subroutine flavour() defines a list of peas ("chick" and "garbanzo"), called @peas. And this is where my comes in. flavour's @peas has exactly the same name as the @peas in the main body of the program. How is perl supposed to know the difference? What my does is prevent the @peas in the subroutine from trashing the @peas in the main body of the program. Try this out:

#!/usr/bin/perl
use warnings;
@peas = qw/chick mushy/;
    # The body of the program contains an array called @peas
print "In the body of the program, \@peas contains @peas.\n";
trasher();
    # Call the subroutine, no need for arguments
print "Oh dear, it appears that \@peas in the body of the program has been trashed, "
  . "and now contains @peas.\n";
print "This is because \@peas in the subroutine overwrites the \@peas in main.\n";
sub trasher
{
    @peas = qw/petit-pois yellow-gram/;
        # Because we haven't pinned  this @peas down with 'my',
        # it refers to the same array as that in the body of the program
    print "In the subroutine trasher, \@peas contains @peas.\n";
}
In the body of the program, @peas contains chick mushy.
In the subroutine trasher, @peas contains petit-pois yellow-gram.
Oh dear, it appears that @peas in the body of the program has been trashed, "
  . "and now contains petit-pois yellow-gram.

And note that without the my to pin down the two separate @peas to their proper places, subroutines have free reign to overwrite variables in the body of the program. This is a Bad Thing: subroutines can change the value of variables in the body of the program, but that doesn't mean they should be allowed to! In general, a good subroutine is a black box: you feed it values, and it feeds values back. That way, people can use your subroutines and functions (as they would if you packaged them up into a nice module), without worrying what they might do to the variables in their program, or indeed, what their program might do to yours. Sometimes, you really will want a subroutine to change a 'global' variable, that is one in the body of a program, but more often than not, you don't, and my is the way to stop it, thus:

#!/usr/bin/perl
use warnings;
@peas = qw/chick mushy/;
print "In the body of the program, \@peas contains @peas.\n";
well_behaved( );
print "Using my, we have avoided trashing \@peas in the body of the program\n";
print "\tIt still contains @peas.\n";
sub well_behaved
{
    my @peas = qw/petit-pois yellow-gram/;
    print "In the subroutine well_behaved, \@peas contains its own values, @peas.\n";
}
In the body of the program, @peas contains chick mushy.
In the subroutine well_behaved, @peas contains its own values, petit-pois yellow-gram.
Using my, we have avoided trashing @peas in the body of the program
    It still contains chick mushy.

So what exactly does my do? It stops a variable being visible outside the block in which it is created (declared). Blocks are things enclosed in { } braces:

BODY OF PROGRAM HERE
START OF OUTER BLOCK {
    OUTER BLOCK'S SCOPE EXTENDS FROM HERE
      start of inner block {
      inner block's scope
      } end of inner block
    TO HERE AND INCLUDES THE INNER BLOCK'S SCOPE TOO
} END OF OUTER BLOCK

The 'scope' is basically what is enclosed in a block. If you created a my variable in the inner block, only things in the scope of the inner block could see it. The outer block would not be able to see it (or trash it) at all. If you created a my variable in the outer block, only things in the outer block's scope could see it (but this happens to include the inner block too!). The BODY OF PROGRAM couldn't see either. A subroutine is just a particular case of this:

BODY OF PROGRAM HERE
START OF SUBROUTINE BLOCK {
    SUBROUTINE'S SCOPE EXTENDS FROM HERE
      start of inner block {
      inner block's scope
      } end of inner block
    TO HERE AND INCLUDES THE INNER BLOCK'S SCOPE TOO
} END OF SUBROUTINE BLOCK

So the @peas declared in the subroutine well_behaved() is only visible (and is the first variable of that name that is visible) within the braces that surround the subroutine:

sub well_behaved
{
    my @peas = qw/petit-pois yellow-gram/;
    print "In the subroutine thing, \@peas contains @peas.\n";
}

Outside this italic 'scope', my @peas is invisible, to both the body of the program, and to any other subroutines you might create. A my variable is only visible from the place it's created to the end of the innermost enclosing block. There a few quasi-exceptions to this:

foreach my $pea ( @peas ) { print $pea; }

DWIMs: the $pea belongs to the inner block, the rest of the program can't see it, even though it seems to be declared in the scope of the program, not the foreach block. This is a Good Thing. One thing to be careful of is if you want to use a loop to stuff things into a my variable:

foreach ( @a ) { my @b; push @b, $_; } # WRONG
my @b; foreach ( @a ) { push @b, $_; } # RIGHT

The first one will create a new @b on each pass of the loop, and when the loop exits, @b goes out of scope, so you can't see it anyway! Waste of time. Use the second one. While we're on the subject of foreach loops, you should know that the loop variable stands for the actual variable from the list you're looping over, so mucking with it will muck with the original list:

#!/usr/bin/perl
use warnings;
my @bits = qw/ b c m t /;
print "@bits\n";
foreach my $bit ( @bits ) { $bit .= "ap" };
print "@bits\n";
b c m cr
bap cap map crap

To be very good, and to allow the program to pass with use strict; we must also put my on variables in the body of the program. These will still be visible to subroutines (since the scope of the body includes all its subroutines), and subroutines can still change them, but they will stop use strict; from barfing. It also has some other advantages when we get to playing with modules.

The penultimate bit of the program we were originally discussing was this:

    foreach ( @peas )
    {
        if ( $query eq $_ )
        {
            return "delicious";
        }
    }

This part compares the type of pea the subroutine was passed with all the peas in its own @peas, and if it matches any of them, the subroutine returns 'delicious'. Furthermore, you have just met perl's most important conditional statement, if:

if     ( THIS_IS_TRUE ) { DO_SOMETHING; }

which is analogous to:

while  ( THIS_IS_TRUE ) { DO_SOMETHING; }

The equivalent of:

until  ( THIS_IS_TRUE ) { DO_SOMETHING; }

is:

unless ( THIS_IS_TRUE ) { DO_SOMETHING; }

The actual comparison the if statement makes is:

$query eq $_

The eq tests to see if two strings are identical. Perl has two sets of comparisons: numerical and string. The 'equal to' test is eq for strings, and == for numbers (that's two = signs). Perl goof number one is getting == comparison and = assignment mixed up.

In addition to 'equal to' comparisons, Perl also has greater than, less than, greater than or equal to, less than or equal to, and not equal to comparisons. For numbers these are >, <, <=, >=, and != respectively. The equivalents for strings are gt, lt, ge, le, and ne.

The reason Perl makes a distinction between numerical and string comparisons is because "2" and "2.0" are numerically equal, but not stringily equal : "2" == "2.0" is TRUE because 2 and 2.0 are the same numerically (I don't want to hear any mathematicians whining about reals and integers either). However, "2" eq "2.0" is FALSE, because they are clearly not the same string of characters. Just remember you want the maths symbols to compare things as numbers, and the language symbols to compare them as strings.

if statements can be optionally followed by any number of elsif statements, and an optional else statement, so:

if    ( THIS_IS_TRUE )
{
    DO_THIS_THING;
}
elsif ( THIS_OTHER_THING_IS_TRUE )
{
    DO_THIS_OTHER_THING;
}
else
{
    DO_THE_DEFAULT_THING;
}

Which is all very simple and obvious. You can also nest if's inside other if's to a gazillion degrees, which is a perfect way of making code unreadable, but will be necessary from time to time.

Anyway, the upshot for the code we're looking at:

    foreach ( @peas )
    {
        if ( $query eq $_ )
        {
            return "delicious";
        }
    }

is that if the type of pea flavour() gets passed matches anything in flavour()'s own @peas, it will return "delicious", using:

return "delicious";

return simply returns the list of things you give it (here the list is just one item long). So if we pass flavour() the value 'chick', which is in flavour()'s list of delicious peas, flavour('chick') will be 'delicious' and this is exactly what is printed out by the body of the program. However, if what we pass doesn't match any of flavour()'s preferences, the foreach loop will end naturally, and we come across:

return "disgusting";

which it duly does.

If you come from a C background, you may be wondering if Perl has a switch statement, which, if you don't, is basically a shorthand for a very long if...elsif...elsif...elsif...else statement. Perl doesn't currently have one of these, but Perl 6 will do. For the moment, you'll have to make do with:

for ( $arg )
{
    /^quit$/ && do { exit 0; } ;
    /^help$/ && do { system "perldoc $0" };
}

Which you'll probably not understand until you've covered regexes anyway!

Summary

That's largely all this is to subroutines: create (declare) them with a:

sub blah { DO_SOMETHING; }

use (call) them with a:

blah( LIST_OF_ARGUMENTS );
blah( $calar, @nd_an_array_too, @nd_another_array );
blah(); # if blah doesn't need telling what to do

All the arguments - including any items from arrays passed as arguments - will be flattened into a single long list, which is passed to the subroutine, and available for manipulation within the subroutine inside the default array:

@_

which you can get at using any array operator (or assigning it to a list).

my $arg1 = shift @_;
my $arg2 = pop @_;
my $arg3 = shift; # defaults to @_
my( $arg4, @args5 ) = @_;

Exit the subroutine with:

return ( "something\n", 'and maybe another', $thing, @or_things );
return; # or just exit without returning anything at all

Subroutines will return without an explicit return with the value they last evaluated. I always use return as I like to be explicit. You can capture what is returned in the usual way: if blah() takes a list of arguments, and returns just one thing:

$thing_returned_by_blah = blah( $argument, @other_arguments );

or if blah takes no arguments at all but returns a list:

@lot_of_things = blah();

etc., etc.

Finally, be warned that:

use strict;
if ( $you_do_not_use eq "my variables" )
{
    my @variables;
    my $pinned_down;
    print "you'll trash variables of the same name in the program body.\n";
    print "and strict will kill you";
}

Test yourself

See if you can write a script that does the following:

#!/usr/bin/perl
use strict;
use warnings;
print "Please enter the names of trees you want to find out about...\n";
while ( my $tree = <STDIN> )
{
    chomp $tree;
    exit if $tree eq 'STOP';
    my $uses = uses( $tree );
    if ( $uses )
    {
        print "The products of $tree include $uses.\n";;
    }
    else
    {
        print "I don't have any information about this tree.\n";
    }
}
sub uses
{
    my $tree = shift;
    my %uses =
    (
        oak    => "wood, acorns",
        apple  => "apples, jam, juice",
        orange => "oranges, neroli oil",
        pine   => "pallet boards",
    );
    return $uses{ $tree };
}

Next…