Lesson 4

Anyone got some hash? Sorted

Earlier, we covered arrays in some detail, and learnt the various functions, like push and pop that you can torture them with. Hashes are our next port of call: as I said before, they are extremely useful, and they're also the basis of most perl objects, which we'll cover soon (just to please all you Java programmers). Perl stores the pairs of a hash in essentially random order (well, random to you anyway, perl knows exactly what it's doing!). So operations like pushing and popping don't make any sense, as you'll not know what you're getting. We've covered how to get out bits of a hash:

my %bits = ( soy => 'sauce', sesame => 'oil', garlic => 'clove' );
my $one_item = $bits{ 'soy' };
my @several_items = @bits{ 'sesame', 'garlic' };

To create a new member of the hash, you can't use push, as it doesn't make any sense, so you need to write:

$hash{ 'new_key' } = 'new_value';

You don't need the quotes around the key when you access or create hashes or their elements:

$hash{newkey} = 'new_value';

If you want to find out if a particular hash entry exists, you can use the exists function:

print "yep, it's there\n" if exists $bits{soy};
if ( exists $bits{soy} )
{
    print "yep it's there\n";
}

Both of these do the same thing, the first just shows you that you can append if statements in just the same way you can append foreach. The same applies to for, while, unless, and until. If you want to remove a hash key, use delete.

delete $bits{soy};

will remove the pair ( soy => 'sauce' ) from the hash. These functions are all very useful, but the most common thing you'll want to do with hashes is iterate over the items in the hash, in much the same way foreach ( @array ) iterates over the members of an array. There are no less than three variations on this theme. The first is each, which will return a pair from a hash. You'll most often see this in constructs using while, like:

#!/usr/bin/perl
use strict;
use warnings;
my %bits = ( soy => 'sauce', sesame => 'oil', garlic => 'clove' );
while ( my ( $key, $value ) = each %bits )
{
    print "$key has value $value\n";
}
sesame has value oil
garlic has value clove
soy has value sauce

This iterates over the items in the hash, assigning the key value pairs to $key and $value in turn. Note the ( ) parentheses around $key and $value. You need these because each returns a two-item-long list. Slinging about lists is one of perl's strengths:

my @things = ( 1, 2, 'three' ); # assign a list to an array
my ( $one, $two, $three ) = ( 1, 2, 'three' );
    # assign a list of values to a list of variables
($x, $y) = ($y, $x); # swap two scalars

Note the brackets around the ($one, $two, $three). You need these to make perl realise it's a list, just as when you create arrays. If you miss them off, perl will try to evaluate $one, $two and $three separately (i.e. in scalar context), and therefore come up with the last thing it evaluated, which is $three. It will then do exactly the same to the other side, come up with 'three', then go " $three = 'three' ", and nothing else. $one and $two will never be assigned anything. You need brackets to force list context, in the same way as you sometimes need scalar to force scalar context. One important thing to note is that if you put an @array in something like this, it will be greedy:

#!/usr/bin/perl
use strict;
use warnings;
my ( @greedy, $starving ) =
    ( 'some', 'other', qw/things using the qw operator/ );
print "\@greedy : @greedy\n\$starving : $starving\n";
@greedy : some other things using the qw operator
$starving :

$starving will never get anything: arrays will slurp up everything from a list. There are various ways around this: here's just one (if you know how many items you want to put in the array):

#!/usr/bin/perl
use strict;
use warnings;
my( @greedy, $satiated );
( @greedy[ 0 .. 5 ], $satiated ) =
( 'some', 'other', qw/things using the qw operator/ );
print "\@greedy : @greedy\n\$satiated : $satiated\n";
@greedy : some other things using the qw
$satiated : operator

using a slice assignment:

@greedy[ 0 .. 5 ]

is fairy self-explanatory: it is a slice of the array, using the .. range operator, so this is just shorthand for:

@greedy[ 0, 1, 2, 3, 4, 5 ]

and will work just fine: the array will only get stuff up to and including the word 'quotewords', and $satiated will get 'operator'. Bear this in mind when you mess with @_ in subroutines:

( @gets_everything, $gets_nothing ) = @_;

Getting round this array flattening and greediness will be covered when we talk about references. So, getting back to hashes:

while ( my ( $key, $value ) = each %bits )
{
    print "$key has value $value\n";
}

each generates a two item long list, which is captured into $key and $value, and this is repeated over the entire hash using a while loop. Note I've bunged in a my too, I'll be using strict from now on, in the interests of getting you into good habits.

The other two ways of torturing a hash are to pull out its keys or its values, with the relevant keyword. So:

#!/usr/bin/perl
use strict;
use warnings;
my %trees =
(
    acorn      => "Quercus",
    oak        => "Quercus",
    beech      => "Fagus",
    yew        => "Taxus",
    maidenhair => "Ginkgo",
);
foreach ( keys %trees )
{
    print "\%trees contains the Latin name for $_.\n";
}
foreach ( values %trees )
{
    print "\%trees knows some English names for $_.\n";
}
%trees contains the Latin name for maidenhair.
%trees contains the Latin name for beech.
%trees contains the Latin name for yew.
%trees contains the Latin name for acorn.
%trees contains the Latin name for oak.
%trees knows some English names for Ginkgo.
%trees knows some English names for Fagus.
%trees knows some English names for Taxus.
%trees knows some English names for Quercus.
%trees knows some English names for Quercus.

I've escaped the % in the double quoted strings: you don't need to do this, as unlike arrays and scalars, hashes don't interpolate their contents in a double quoted string. However, it doesn't hurt, and may be easier for you to remember. Note that hashes can have several values that are the same (Quercus twice): only keys have to be unique. If both your keys and your values are unique, you can make a bilingual dictionary with reverse...

#!/usr/bin/perl
use strict;
use warnings;
my %Eng_to_Esp =
(
    one   => 'unu',
    two   => 'du',
    three => 'tri',
    four  => 'kvar',
    five  => 'kvin'
);
my %Esp_to_Eng = reverse %Eng_to_Esp;
print "The Esperanto for two is $Eng_to_Esp{two}.\n";
print "And the English for kvar is $Esp_to_Eng{kvar}.\n";
The Esperanto for two is du.
And the English for kvar is four.

You can also see that although a hash itself won't interpolate in a double quoted string, its members (and items from a normal array) will. Something you'll often want to do is sort a list, especially with hashes: as the keys, values and each pairs are essentially in a random order, you'll often want to torture them into something more ordered. Perl happily has a function called sort for just these occasions:

#!/usr/bin/perl
use strict;
use warnings;
my %trees =
(
    oak        => "Quercus",
    beech      => "Fagus",
    yew        =>"Taxus",
    maidenhair => "Ginkgo",
);
print "$_.\n" foreach ( sort keys %trees );
beech.
maidenhair.
oak.
yew.

By default, sort sorts things 'ASCIIbetically':

#!/usr/bin/perl
use strict;
use warnings;
my %trees =
(
    Oak        => "Quercus", # capital O
    beech      => "Fagus",
    yew        =>"Taxus",
    maidenhair => "Ginkgo",
);
print "$_.\n" foreach ( sort keys %trees );
Oak.
beech.
maidenhair.
yew.

It sorts strings by the ASCII values of their characters, hence O comes before b, because the ASCIIbet goes something like 0, 1, 2 .. 9, (some other things), A, B, C .. Z, (few bits), a, b, c .. z. As here:

#!/usr/bin/perl
use strict;
use warnings;
print "The ASCII value of O is ", ord "O", "\n";
print "The ASCII value of b is ", ord "b", "\n";
The ASCII value of O is 79
The ASCII value of b is 98

This also demonstrates the use of ord, which tells you the ASCII value of a letter. chr does the opposite, converting ASCII numbers to characters.

#!/usr/bin/perl
use strict;
use warnings;
print chr( $_ ) foreach ( 74, 117, 115, 116, 32,
97, 110, 111, 116, 104, 101, 114, 32, 112, 101,
114, 108, 32, 104, 97, 99, 107, 101, 114, 46);
Just another perl hacker.

Anyway, the point is, if you want your data sorted numerically, or properly alphabetically, rather than ASCIIbetically, you'll need to twiddle with sort. sort can take an optional extra bit that tells it how to sort:

#!/usr/bin/perl
use strict;
use warnings;
my @numbers = ( 1, 2, 3, 4, 100, 101, 102, 6); # 6 is out of order
my @default_sorted = sort @numbers;
my @numerically_sorted = sort { $a <=> $b } @numbers;
print " DEFAULT: @default_sorted\n NUMERICALLY: @numerically_sorted\n";
DEFAULT: 1 100 101 102 2 3 4 6
NUMERICALLY: 1 2 3 4 6 100 101 102

Note the default output: 100 comes before 2, because the first character of 100, '1', comes before the first character of 2, '2'. So how does the numerical sort work? The extra bit sort needs is a block squashed between the keyword sort and the things to sort, surrounded by braces { }.

sort { $a <=> $b } @numbers;

The spaceship operator, <=> compares two numbers and returns certain values depending on which is larger. The values it compares are $a and $b, which are sort's default variables, and stand for pairs of things taken from @numbers. perl does the actual sorting itself: all you need to tell perl is, given a pair of numbers ($a and $b), which one is bigger i.e. should come later in the sorted list?

The spaceship operator is a built-in comparison thingummy that does just this for numbers. For strings, the equivalent is cmp (remember == vs. eq), which compares strings character by character according to their ASCII values. Hence:

sort { $a cmp $b } @strings;

is the same as just plain old:

sort @strings;

To sort things properly alphabetically, you might try:

#!/usr/bin/perl
use strict;
use warnings;
my @trees = qw/oak ash Ginkgo Quercus linden Fraxinus lychee/;
print "$_\n" foreach ( sort { lc( $a ) cmp lc( $b ) } @trees );
ash
Fraxinus
Ginkgo
linden
lychee
oak
Quercus

lc stands for 'lower case': it returns strings it is given in lowercase, here so they can be compared without worrying that A-Z comes before a-z in the ASCIIbet. You'll never guess what uc does.

You can define much more complicated and arbitrary sorting schemes than these, using the '1', '-1', '0' thing. In many of these cases, it's more convenient to define a subroutine to do the comparisons, such as in_my_arbitrary_way, then call it using:

@weird_sorted = sort in_my_arbitrary_way @things;

Say you'd prefer it if the first word in the dictionary was 'xenon', but then afterwards, carried on as normally:

#!/usr/bin/perl
use strict;
use warnings;
my @strings = qw( zebedee blob aardvark xenon shark cat dog );
my @funny_sorted = sort funny_sort @strings;
print "@funny_sorted\n";
sub funny_sort
{
    if    ( $a eq 'xenon' )
    {
        return -1; 
            # if $a is xenon, $a should come earlier, so -1
    }
    elsif ( $b eq 'xenon' )
    {
        return 1; 
            # if $b is xenon, $a must come later, so 1
    }
    else
    {
        return ( lc( $a ) cmp lc ( $b ) ); 
            # otherwise sort alphabetically
    }
}
xenon aardvark blob cat dog shark zebedee

This will run under use strict; even though we've not 'scoped' the $a and $b in the subroutine using my. This is because $a and $b, as well as all the funny punctuation variables like $_, are exempt from scoping (indeed, you cannot scope most of them), and you don't need to scope them. This is a bit of a wart and due to change in Perl 6.

Summary

That's sort pretty much sorted: you can use it in any of these ways:

@sorted = sort @unsorted;
    # use the default ASCIIbetical sort
@sorted = sort { DO_SOMETHING_WITH_$a_AND_$b } @unsorted;
    # use your own sort
@sorted = sort my_sorting_subroutine @unsorted;
    # define your own sort sub elsewhere

As usual in perl, there's more than one way to do things, and there are some clever tricks you can use to speed up sorting, especially if you're sorting on more than one field. We'll leave these more advanced sorting methods until a later lesson.

Hashes are as simple to use as arrays too: you can use any of the following for hash torture:

my %hash =
(
    telephone   => "Bell",
    television  => "Baird",
    lightbulb   => "Edison",
    Jesus       => "Saul of Tarsus",
);
print $hash{ lightbulb };                 # access
print @hash{ lightbulb, television };     # slice
$hash{ www } = "Berners Lee";             # append
print "Yes" if exists $hash{ telephone }; # exist
delete $hash{ Jesus };                    # remove
while ( my( $k, $v ) = each %hash )
{
    print "$v invented $k\n";                              # iterate
} 
print keys %hash;                         # keys
print sort values %hash;                  # values

Typeglobs and symbol tables

That's pretty much everything for hashes, except for one topic usually labelled: 'for experts only'. Well, in the interests of giving you enough rope to hang yourself, and because it's difficult to find stuff about it, I'm going to tell you a little about perl's innards. Perl has it's own internal hash, called the Symbol Table, or %main:: (that's 'hash main double colon'). Mucking about with it really is for experts, but it's worth introducing you. Try this out:

#!/usr/bin/perl
# use strict; # turn off strictures, for reasons we'll come to in a minute
use warnings;
$pibble = 2;
@foo = ( 1, 4 );
%bits = ( me => 'tired' );
sub my_sort { return ( $a cmp $b ) }
foreach ( sort keys %main:: )
{
    print "This perl program has a symbol called $_.\n";
}
This perl program has a symbol called STDIN.
This perl program has a symbol called pibble.
...

This program will print stuff about the 'symbols' perl has defined for you (like STDIN), and the symbols you have created (like $pibble, and the name of the subroutine my_sort). Somewhere you will find pibble, foo, bits and my_sort. You'll also find a lot of other things, including STDIN, the name of the standard input filehandle, and a and b (as in $a and $b). Hacking on the symbol table is very powerful, and gives you a taster of what self-manipulating cleverness you can do with Perl: you can actually use Perl to muck about with how a program works as the program is running.

If this is boring or confusing you, feel free to go onto the next section, but if you'd like just a bit more, read on. You can always come back to this later.

The symbol table is just a hash, with the rather obscure name %main:: , and that program just printed out the keys of that hash. If you want to see the values, you'll have to be acquainted with Perl's final, and most esoteric data type, the typeglob, and another type of scoping besides my. Arrays have @, scalars have $, and typeglobs have *. In a way, a typeglob *foo, contains the definitions of $foo, @foo, %foo, the filehandle foo, and the subroutine sub foo (which is called &foo : subs get & as their sigil) all rolled into one. Try this program out:

#!/usr/bin/perl
# use strict;
# use warnings;
# define some things
$pibble = 2;
@foo = ( 1, 4 );
$foo = 'bar';
%foo = ( key => 'value' );
%bits = ( me => 'tired' );
sub my_sort { return ( $a cmp $b ) }
print "This program contains...\n";
while ( my ( $key, $value ) = each %main:: )
# iterate over the key/value pairs of the symbol table hash
{
    local *symbol = $value;
    # this assigns the value from the symbol table to a typeglob
    # these lines look to see if the typeglob contains 
    # a $, %, @ or & definition
    if ( defined $symbol )
    {
        print "a scalar called \$$key\n";
        # \$$k is just an escaped $ 
        # followed by the contents of variable $key
    }
    if ( defined @symbol )
    {
        print "an array called \@$key\n";
    }
    if ( defined %symbol )
    {
        print "a hash called \%$key\n";
    }
    if ( defined &symbol )
    {
        print "a subroutine called $key\n";
    }
}
a hash called %ENV
a scalar called $pibble
a scalar called $_
a hash called %UNIVERSAL::
a scalar called $foo
an array called @foo
a hash called %foo
a scalar called $$
...

The values from the symbol table hash are typeglobs, looking something like *main::foo, *main::ENV, *main::_ , etc. If you create your own local typeglob, *symbol, to contain one of these values from the symbol table, you can look to see if the various sub-types (scalar, array, etc.) are defined using $symbol, @symbol, %symbol and &symbol. So, as the loop runs through the $key, $value pairs from the symbol table, $value will at some point contain *main::foo. So:

local *symbol = $value;

creates a [local] typeglob *symbol containing the definitions of symbols called main::foo, and

if ( defined %symbol )

will ask 'is there a hash in the symbol table called %main::foo?'. (Hope that's clear! It took me a long while to get my head round this too). The main:: bit means that we're looking at symbols from the 'main' symbol table. A perl program can use more than one symbol table: we'll get onto this when we talk about packages and modules later: the main package and symbol table is simply the one that perl assumes your program is using if you don't set it explicitly.

You probably are bored and confused now, so here's another chance to wuss out:

Otherwise, we'll cover the last complication. Try sticking a my on any of the variables you've defined, like $foo, and run the program. You'll find they suddenly disappear from the symbol table. What on earth is happening? Well, the dirty secret is that perl actually has two completely independent variable sets. Those that you create without a my (or explicitly create using an our), are perl's old-style global or package variables, which live in the symbol table, and are extractable with typeglobs (this includes all subroutine definitions anywhere, as you can't use my on these as yet). These variables are global, and any program using your code can access them. Even if they're defined in a module, like File::Find, which is a completely separate file, all you need to mess with them is to know the package to which they belong (here File::Find), the name of the variable ($dir) and you can muck about with them happily:

$File::Find::dir = "plopsy";

to probably fatal effect. The reason these package variables were added to in Perl 5 was because there was no way to make them truely private to a subroutine or similar. There was no my in Perl 4, and you had to use a thing called local, which you've seen above with a typeglob, to create temporary dynamically scoped (as opposed to lexically scoped my) variables:

#!/usr/bin/perl
use strict;
use warnings;
$variable = "hello";
print "\$variable is $variable in the body.\n";
temporary();
print "\$variable is still $variable in the body.\n";
sub temporary
{
    local $variable = "goodbye";
    print "\$variable is $variable in the temporary sub.\n";
}
$variable is hello in the body.
$variable is goodbye in the temporary sub.
$variable is still hello in the body.

This looks to have exactly the same effect as my would, but in fact we're still talking about the same $variable, it just so happens that perl stashes away the original value when it hits the local, and replaces it when it returns to the body of the program. The symbol table entry is temporarily changed to its new value. In contrast, my creates a completely separate, fresh and unsullied variable with no relationship whatsoever to variables of the same name elsewhere in the program. To see the difference, if you called another subroutine from within temporary(), $variable would still be set to its temporary value of 'goodbye':

#!/usr/bin/perl
use strict;
use wranings;
$variable = "hello";
print "\$variable is $variable in the body.\n";
temporary();
print "\$variable is still $variable in the body.\n";
sub temporary
{
    local $variable = "goodbye";
    print "\$variable is $variable in the temporary sub.\n";
    inner();
}
sub inner
{
    print "\$variable is $variable in the inner sub.\n";
}
$variable is hello in the body.
$variable is goodbye in the temporary sub.
$variable is goodbye in the inner sub.
$variable is still hello in the body

In contrast, 'lexically scoped', my variables live in only a particular part (scope) of the program, and are completely inaccessible outside of it. Each new my $variable is a completely different $variable. They do not appear in any symbol table (although they will in Perl 6). If you were to put my instead of local:

#!/usr/bin/perl
use strict;
use warnings;
$variable = "hello";
print "\$variable is $variable in the body.\n";
temporary();
print "\$variable is still $variable in the body.\n";
sub temporary
{
    my $variable = "goodbye";
    print "\$variable is $variable in the temporary sub.\n";
    inner();
}
sub inner
{
    print "\$variable is $variable in the inner sub.\n";
}
$variable is hello in the body.
$variable is goodbye in the temporary sub.
$variable is hello in the inner sub.
$variable is still hello in the body.

You'll see that the $variable in temporary() is now a completely different variable, isolated from the rest of the program, unrelated to the $variable in the body of the program, and certainly not accessible from inner() any more. inner() prints out the only $variable visible in its scope, which is the one in the body of the program.

So why have we bothered with all this? Well, one of Perl 5's warts is that certain things can't be scoped with my, including the global punctuation variables like $_ and $/, and typeglobs. Although you'll almost never need to, you will sometimes need local versions of these to prevent you trashing things in the body of your program, or worse, in other people's programs if you write modules. Otherwise, steer clear of local!

And I think that is probably plenty enough for the time being! You can always come back later if that made no sense!

Test yourself

See if you can write a script that does the following:

#!/usr/bin/perl
use strict;
use warnings;
my %Eng2Fr =
(
    one   => 'un',
    two   => 'deux',
    three => 'trois',
    four  => 'quatre',
    five  => 'cinq',
    six   => 'six',
    seven => 'sept',
    eight => 'huit',
    nine  => 'neuf',
    ten   => 'dix',
);
print "$_ ( $Eng2Fr{$_} )\n" for 
    sort { length $b <=> length $a } keys %Eng2Fr;
# Perl est puissant, n'est ce pas?.
# Note that sort{$b<=>$a} is more efficient 
# than reverse sort{$a<=>$b}, although maybe not as readable

Next…