Anyone got some hash? Sorted
Earlier, we covered arrays in
some detail, and learnt the various functions, like push and
pop that you can torture them with. Hashes are our next port
of call: as I said before, they are extremely useful, and they're also
the basis of most perl objects, which
we'll cover soon (just to please all you Java programmers). Perl stores
the pairs of a hash in essentially random order (well, random to you
anyway, perl knows exactly what it's doing!). So operations like
pushing and popping don't make any sense, as
you'll not know what you're getting. We've covered how to get out bits of
a hash:
my %bits = ( soy => 'sauce', sesame => 'oil', garlic => 'clove' );
my $one_item = $bits{ 'soy' };
my @several_items = @bits{ 'sesame', 'garlic' };
To create a new member of the hash, you can't use push,
as it doesn't make any sense, so you need to write:
$hash{ 'new_key' } = 'new_value';
You don't need the quotes around the key when you access or create hashes or their elements:
$hash{newkey} = 'new_value';
If you want to find out if a particular hash entry exists, you can use
the exists function:
print "yep, it's there\n" if exists $bits{soy};
if ( exists $bits{soy} )
{
print "yep it's there\n";
}
Both of these do the same thing, the first just shows you that you can
append if statements in just the same way you can append
foreach. The same
applies to for, while, unless, and
until. If you want to remove a hash key, use
delete.
delete $bits{soy};
will remove the pair ( soy => 'sauce' ) from the hash.
These functions are all very useful, but the most common thing you'll
want to do with hashes is iterate over the items in the hash, in much the
same way foreach ( @array ) iterates over the members of an
array. There are no less than three variations on this theme. The first
is each, which will return a pair from a hash. You'll most
often see this in constructs using while, like:
#!/usr/bin/perl
use strict;
use warnings;
my %bits = ( soy => 'sauce', sesame => 'oil', garlic => 'clove' );
while ( my ( $key, $value ) = each %bits )
{
print "$key has value $value\n";
}
sesame has value oil garlic has value clove soy has value sauce
This iterates over the items in the hash, assigning the key value
pairs to $key and $value in turn. Note the
( ) parentheses around $key and
$value. You need these because each returns a
two-item-long list. Slinging about lists is one of perl's strengths:
my @things = ( 1, 2, 'three' ); # assign a list to an array
my ( $one, $two, $three ) = ( 1, 2, 'three' );
# assign a list of values to a list of variables
($x, $y) = ($y, $x); # swap two scalars
Note the brackets around the ($one, $two, $three). You
need these to make perl realise it's a list, just as when you create
arrays. If you miss them off, perl will try to evaluate
$one, $two and $three
separately (i.e. in scalar context), and therefore come
up with the last thing it evaluated, which is $three. It
will then do exactly the same to the other side, come up with
'three', then go " $three = 'three' ", and
nothing else. $one and $two will never
be assigned anything. You need brackets to force list context,
in the same way as you sometimes need scalar to force
scalar context. One important thing
to note is that if you put an @array in something like this,
it will be greedy:
#!/usr/bin/perl
use strict;
use warnings;
my ( @greedy, $starving ) =
( 'some', 'other', qw/things using the qw operator/ );
print "\@greedy : @greedy\n\$starving : $starving\n";
@greedy : some other things using the qw operator $starving :
$starving will never get anything: arrays will slurp up
everything from a list. There are various ways around this: here's just
one (if you know how many items you want to put in the array):
#!/usr/bin/perl use strict; use warnings; my( @greedy, $satiated ); ( @greedy[ 0 .. 5 ], $satiated ) = ( 'some', 'other', qw/things using the qw operator/ ); print "\@greedy : @greedy\n\$satiated : $satiated\n";
@greedy : some other things using the qw $satiated : operator
using a slice assignment:
@greedy[ 0 .. 5 ]
is fairy self-explanatory: it is a slice of the array, using the
.. range operator, so this is just shorthand for:
@greedy[ 0, 1, 2, 3, 4, 5 ]
and will work just fine: the array will only get stuff up to and
including the word 'quotewords', and $satiated
will get 'operator'. Bear this in mind when you mess with
@_ in subroutines:
( @gets_everything, $gets_nothing ) = @_;
Getting round this array flattening and greediness will be covered when we talk about references. So, getting back to hashes:
while ( my ( $key, $value ) = each %bits )
{
print "$key has value $value\n";
}
each generates a two item long list, which is captured
into $key and $value, and this is repeated over
the entire hash using a while loop. Note I've bunged in a
my too, I'll be using strict from now on, in
the interests of getting you into good habits.
The other two ways of torturing a hash are to pull out its
keys or its values, with the relevant keyword.
So:
#!/usr/bin/perl
use strict;
use warnings;
my %trees =
(
acorn => "Quercus",
oak => "Quercus",
beech => "Fagus",
yew => "Taxus",
maidenhair => "Ginkgo",
);
foreach ( keys %trees )
{
print "\%trees contains the Latin name for $_.\n";
}
foreach ( values %trees )
{
print "\%trees knows some English names for $_.\n";
}
%trees contains the Latin name for maidenhair. %trees contains the Latin name for beech. %trees contains the Latin name for yew. %trees contains the Latin name for acorn. %trees contains the Latin name for oak. %trees knows some English names for Ginkgo. %trees knows some English names for Fagus. %trees knows some English names for Taxus. %trees knows some English names for Quercus. %trees knows some English names for Quercus.
I've escaped the % in the double quoted strings: you
don't need to do this, as unlike arrays and scalars, hashes don't
interpolate their contents in a double quoted string. However, it doesn't
hurt, and may be easier for you to remember. Note that hashes can have
several values that are the same (Quercus twice): only
keys have to be unique. If both your keys and your values are
unique, you can make a bilingual dictionary with
reverse...
#!/usr/bin/perl
use strict;
use warnings;
my %Eng_to_Esp =
(
one => 'unu',
two => 'du',
three => 'tri',
four => 'kvar',
five => 'kvin'
);
my %Esp_to_Eng = reverse %Eng_to_Esp;
print "The Esperanto for two is $Eng_to_Esp{two}.\n";
print "And the English for kvar is $Esp_to_Eng{kvar}.\n";
The Esperanto for two is du. And the English for kvar is four.
You can also see that although a hash itself won't interpolate in a
double quoted string, its members (and items from a normal array) will.
Something you'll often want to do is sort a list, especially
with hashes: as the keys, values and
each pairs are essentially in a random order, you'll often
want to torture them into something more ordered. Perl happily has a
function called sort for just these occasions:
#!/usr/bin/perl
use strict;
use warnings;
my %trees =
(
oak => "Quercus",
beech => "Fagus",
yew =>"Taxus",
maidenhair => "Ginkgo",
);
print "$_.\n" foreach ( sort keys %trees );
beech. maidenhair. oak. yew.
By default, sort sorts things 'ASCIIbetically':
#!/usr/bin/perl
use strict;
use warnings;
my %trees =
(
Oak => "Quercus", # capital O
beech => "Fagus",
yew =>"Taxus",
maidenhair => "Ginkgo",
);
print "$_.\n" foreach ( sort keys %trees );
Oak. beech. maidenhair. yew.
It sorts strings by the ASCII values of their characters, hence O comes before b, because the ASCIIbet goes something like 0, 1, 2 .. 9, (some other things), A, B, C .. Z, (few bits), a, b, c .. z. As here:
#!/usr/bin/perl use strict; use warnings; print "The ASCII value of O is ", ord "O", "\n"; print "The ASCII value of b is ", ord "b", "\n";
The ASCII value of O is 79 The ASCII value of b is 98
This also demonstrates the use of ord, which tells you
the ASCII value of a letter. chr does the opposite,
converting ASCII numbers to characters.
#!/usr/bin/perl use strict; use warnings; print chr( $_ ) foreach ( 74, 117, 115, 116, 32, 97, 110, 111, 116, 104, 101, 114, 32, 112, 101, 114, 108, 32, 104, 97, 99, 107, 101, 114, 46);
Just another perl hacker.
Anyway, the point is, if you
want your data sorted numerically, or properly alphabetically, rather
than ASCIIbetically, you'll need to twiddle with sort.
sort can take an optional extra bit that tells it how to
sort:
#!/usr/bin/perl
use strict;
use warnings;
my @numbers = ( 1, 2, 3, 4, 100, 101, 102, 6); # 6 is out of order
my @default_sorted = sort @numbers;
my @numerically_sorted = sort { $a <=> $b } @numbers;
print " DEFAULT: @default_sorted\n NUMERICALLY: @numerically_sorted\n";
DEFAULT: 1 100 101 102 2 3 4 6 NUMERICALLY: 1 2 3 4 6 100 101 102
Note the default output: 100 comes before 2, because the first
character of 100, '1', comes before the first character of 2, '2'. So how
does the numerical sort work? The extra bit sort needs is a
block squashed between the keyword sort and the
things to sort, surrounded by braces { }.
sort { $a <=> $b } @numbers;
The spaceship operator, <=> compares two numbers
and returns certain values depending on which is larger. The values it
compares are $a and $b, which are
sort's default variables, and stand for pairs of things
taken from @numbers. perl does the actual
sorting itself: all you need to tell perl is, given a
pair of numbers ($a and $b), which one is
bigger i.e. should come later in the sorted list?
- If
$ais bigger, you need to tell perl '1'. - If
$bis bigger, you need to tell perl '-1'. - If they are both equal, you should tell perl '0'.
The spaceship operator is a built-in comparison thingummy that does
just this for numbers. For strings, the equivalent is cmp
(remember == vs.
eq), which compares strings character by character
according to their ASCII values. Hence:
sort { $a cmp $b } @strings;
is the same as just plain old:
sort @strings;
To sort things properly alphabetically, you might try:
#!/usr/bin/perl
use strict;
use warnings;
my @trees = qw/oak ash Ginkgo Quercus linden Fraxinus lychee/;
print "$_\n" foreach ( sort { lc( $a ) cmp lc( $b ) } @trees );
ash Fraxinus Ginkgo linden lychee oak Quercus
lc stands for 'lower case': it returns strings it is
given in lowercase, here so they can be compared without worrying that
A-Z comes before a-z in the ASCIIbet. You'll never guess what
uc does.
You can define much more complicated and arbitrary sorting schemes
than these, using the '1', '-1', '0' thing. In many of these cases, it's
more convenient to define a subroutine to do the comparisons, such as
in_my_arbitrary_way, then call it using:
@weird_sorted = sort in_my_arbitrary_way @things;
Say you'd prefer it if the first word in the dictionary was 'xenon', but then afterwards, carried on as normally:
#!/usr/bin/perl
use strict;
use warnings;
my @strings = qw( zebedee blob aardvark xenon shark cat dog );
my @funny_sorted = sort funny_sort @strings;
print "@funny_sorted\n";
sub funny_sort
{
if ( $a eq 'xenon' )
{
return -1;
# if $a is xenon, $a should come earlier, so -1
}
elsif ( $b eq 'xenon' )
{
return 1;
# if $b is xenon, $a must come later, so 1
}
else
{
return ( lc( $a ) cmp lc ( $b ) );
# otherwise sort alphabetically
}
}
xenon aardvark blob cat dog shark zebedee
This will run under use strict; even though we've not
'scoped' the $a and $b in the subroutine using
my. This is because $a and $b, as
well as all the funny punctuation variables like $_, are
exempt from scoping (indeed, you cannot scope most of them), and you
don't need to scope them. This is a bit of a wart and due to change in
Perl 6.
Summary
That's sort pretty much
sorted: you can use it in any of these ways:
@sorted = sort @unsorted;
# use the default ASCIIbetical sort
@sorted = sort { DO_SOMETHING_WITH_$a_AND_$b } @unsorted;
# use your own sort
@sorted = sort my_sorting_subroutine @unsorted;
# define your own sort sub elsewhere
As usual in perl, there's more than one way to do things, and there are some clever tricks you can use to speed up sorting, especially if you're sorting on more than one field. We'll leave these more advanced sorting methods until a later lesson.
Hashes are as simple to use as arrays too: you can use any of the following for hash torture:
my %hash =
(
telephone => "Bell",
television => "Baird",
lightbulb => "Edison",
Jesus => "Saul of Tarsus",
);
print $hash{ lightbulb }; # access
print @hash{ lightbulb, television }; # slice
$hash{ www } = "Berners Lee"; # append
print "Yes" if exists $hash{ telephone }; # exist
delete $hash{ Jesus }; # remove
while ( my( $k, $v ) = each %hash )
{
print "$v invented $k\n"; # iterate
}
print keys %hash; # keys
print sort values %hash; # values
Typeglobs and symbol tables
That's pretty much everything for
hashes, except for one topic usually labelled: 'for experts only'. Well,
in the interests of giving you enough rope to hang yourself, and because
it's difficult to find stuff about it, I'm going to tell you a little
about perl's innards. Perl has it's own internal hash, called
the Symbol Table, or %main:: (that's 'hash main
double colon'). Mucking about with it really is for experts, but
it's worth introducing you. Try this out:
#!/usr/bin/perl
# use strict; # turn off strictures, for reasons we'll come to in a minute
use warnings;
$pibble = 2;
@foo = ( 1, 4 );
%bits = ( me => 'tired' );
sub my_sort { return ( $a cmp $b ) }
foreach ( sort keys %main:: )
{
print "This perl program has a symbol called $_.\n";
}
This perl program has a symbol called STDIN. This perl program has a symbol called pibble. ...
This program will print stuff about the 'symbols' perl has defined for
you (like STDIN), and the symbols you have created (like
$pibble, and the name of the subroutine
my_sort). Somewhere you will find pibble,
foo, bits and my_sort. You'll also
find a lot of other things, including STDIN, the name of the
standard input filehandle, and a and b (as in
$a and $b). Hacking on the symbol table is very
powerful, and gives you a taster of what self-manipulating cleverness you
can do with Perl: you can actually use Perl to muck about with how a
program works as the program is running.
If this is boring or confusing you, feel free to go onto the next section, but if you'd like just a bit more, read on. You can always come back to this later.
The symbol table is just a hash, with the rather obscure name
%main:: , and that program just printed out the keys of that
hash. If you want to see the values, you'll have to be
acquainted with Perl's final, and most esoteric data type, the
typeglob, and another type of scoping besides my.
Arrays have @, scalars have $, and
typeglobs have *. In a way, a typeglob
*foo, contains the definitions of
$foo, @foo, %foo, the filehandle
foo, and the subroutine sub foo (which is
called &foo : subs get & as
their sigil) all rolled into one. Try this program out:
#!/usr/bin/perl
# use strict;
# use warnings;
# define some things
$pibble = 2;
@foo = ( 1, 4 );
$foo = 'bar';
%foo = ( key => 'value' );
%bits = ( me => 'tired' );
sub my_sort { return ( $a cmp $b ) }
print "This program contains...\n";
while ( my ( $key, $value ) = each %main:: )
# iterate over the key/value pairs of the symbol table hash
{
local *symbol = $value;
# this assigns the value from the symbol table to a typeglob
# these lines look to see if the typeglob contains
# a $, %, @ or & definition
if ( defined $symbol )
{
print "a scalar called \$$key\n";
# \$$k is just an escaped $
# followed by the contents of variable $key
}
if ( defined @symbol )
{
print "an array called \@$key\n";
}
if ( defined %symbol )
{
print "a hash called \%$key\n";
}
if ( defined &symbol )
{
print "a subroutine called $key\n";
}
}
a hash called %ENV a scalar called $pibble a scalar called $_ a hash called %UNIVERSAL:: a scalar called $foo an array called @foo a hash called %foo a scalar called $$ ...
The values from the symbol table hash are typeglobs, looking something
like *main::foo, *main::ENV,
*main::_ , etc. If you create your own
local typeglob, *symbol, to contain one of
these values from the symbol table, you can look to see if the various
sub-types (scalar, array, etc.) are defined using
$symbol, @symbol,
%symbol and &symbol. So, as the loop runs
through the $key, $value pairs from the symbol
table, $value will at some point contain
*main::foo. So:
local *symbol = $value;
creates a [local] typeglob *symbol containing the
definitions of symbols called main::foo, and
if ( defined %symbol )
will ask 'is there a hash in the symbol table called
%main::foo?'. (Hope that's clear! It took me a long while to
get my head round this too). The main:: bit means that we're
looking at symbols from the 'main' symbol table. A perl program can use
more than one symbol table: we'll get onto this when we talk about
packages and modules later: the main package
and symbol table is simply the one that perl assumes your program is
using if you don't set it explicitly.
You probably are bored and confused now, so here's another chance to wuss out:
Otherwise, we'll cover the last complication. Try sticking a
my on any of the variables you've defined, like
$foo, and run the program. You'll find they suddenly
disappear from the symbol table. What on earth is happening? Well, the
dirty secret is that perl actually has two completely
independent variable sets. Those that you create without a
my (or explicitly create using an our), are perl's old-style global
or package variables, which live in the symbol
table, and are extractable with typeglobs (this includes all subroutine
definitions anywhere, as you can't use my on these as yet).
These variables are global, and any program using your code can access
them. Even if they're defined in a module, like File::Find,
which is a completely separate file, all you need to mess with them is to
know the package to which they belong (here
File::Find), the name of the variable ($dir)
and you can muck about with them happily:
$File::Find::dir = "plopsy";
to probably fatal effect. The reason these package variables were
added to in Perl 5 was because there was no way to make them truely
private to a subroutine or similar. There was no my in Perl
4, and you had to use a thing called local, which you've
seen above with a typeglob, to create temporary dynamically
scoped (as opposed to lexically scoped my)
variables:
#!/usr/bin/perl
use strict;
use warnings;
$variable = "hello";
print "\$variable is $variable in the body.\n";
temporary();
print "\$variable is still $variable in the body.\n";
sub temporary
{
local $variable = "goodbye";
print "\$variable is $variable in the temporary sub.\n";
}
$variable is hello in the body. $variable is goodbye in the temporary sub. $variable is still hello in the body.
This looks to have exactly the same effect as my would,
but in fact we're still talking about the same
$variable, it just so happens that perl stashes away the
original value when it hits the local, and replaces it when
it returns to the body of the program. The symbol table entry is
temporarily changed to its new value. In contrast, my
creates a completely separate, fresh and unsullied variable with no
relationship whatsoever to variables of the same name elsewhere in the
program. To see the difference, if you called another subroutine from
within temporary(), $variable would still be
set to its temporary value of 'goodbye':
#!/usr/bin/perl
use strict;
use wranings;
$variable = "hello";
print "\$variable is $variable in the body.\n";
temporary();
print "\$variable is still $variable in the body.\n";
sub temporary
{
local $variable = "goodbye";
print "\$variable is $variable in the temporary sub.\n";
inner();
}
sub inner
{
print "\$variable is $variable in the inner sub.\n";
}
$variable is hello in the body. $variable is goodbye in the temporary sub. $variable is goodbye in the inner sub. $variable is still hello in the body
In contrast, 'lexically scoped', my variables
live in only a particular part (scope) of the program, and are completely
inaccessible outside of it. Each new my
$variable is a completely different $variable.
They do not appear in any symbol table (although they will in Perl 6). If you were to
put my instead of local:
#!/usr/bin/perl
use strict;
use warnings;
$variable = "hello";
print "\$variable is $variable in the body.\n";
temporary();
print "\$variable is still $variable in the body.\n";
sub temporary
{
my $variable = "goodbye";
print "\$variable is $variable in the temporary sub.\n";
inner();
}
sub inner
{
print "\$variable is $variable in the inner sub.\n";
}
$variable is hello in the body. $variable is goodbye in the temporary sub. $variable is hello in the inner sub. $variable is still hello in the body.
You'll see that the $variable in temporary()
is now a completely different variable, isolated from
the rest of the program, unrelated to the $variable in the
body of the program, and certainly not accessible from
inner() any more. inner() prints out the only
$variable visible in its scope, which is the one in the body
of the program.
So why have we bothered with all this? Well, one of Perl 5's warts is
that certain things can't be scoped with my, including the
global punctuation variables like $_ and $/,
and typeglobs. Although you'll almost never need to, you will sometimes
need local versions of these to prevent you trashing things
in the body of your program, or worse, in other people's programs if you
write modules. Otherwise, steer clear of local!
And I think that is probably plenty enough for the time being! You can always come back later if that made no sense!
Test yourself
See if you can write a script that does the following:
- Sort the numbers 1 to 10 by the length of their English name (longest first), and print them out with the French equivalents in brackets. Do it in essentially two lines of code ☺
#!/usr/bin/perl
use strict;
use warnings;
my %Eng2Fr =
(
one => 'un',
two => 'deux',
three => 'trois',
four => 'quatre',
five => 'cinq',
six => 'six',
seven => 'sept',
eight => 'huit',
nine => 'neuf',
ten => 'dix',
);
print "$_ ( $Eng2Fr{$_} )\n" for
sort { length $b <=> length $a } keys %Eng2Fr;
# Perl est puissant, n'est ce pas?.
# Note that sort{$b<=>$a} is more efficient
# than reverse sort{$a<=>$b}, although maybe not as readable
