Tied up in knots
Sometimes you'll want to make a data structure more persistent than the runtime of a perl program, so you can write and read the same (or updated) data between successive runs of the script. Of course, there is more than one way to do this. Say you have a hash whose keys and values you would like to save persistently: maybe these are configuration options, or maybe they are a dictionary of Latin to English plant names. You could:
- Write it out to an external file in an external format, such as plain text or tag soup:
#!/usr/bin/perl
use strict;
my %names = ( "Drosera capensis" => "Cape sundew" );
open my $LATINNAMES, '>', shift or die $!;
while ( my ( $k, $v ) = each %names )
{
print $LATINNAMES
"<item>\n\t<latin>$k</latin>\t<english>$v</english>\n</item>";
}
- Write it out to an external file in an internal format, using
Data::Dumperor similar.
#!/usr/bin/perl use strict; my %names = ( "Drosera capensis" => "Cape sundew" ); open my $LATINNAMES, '>', shift or die $!; print $LATINNAMES Dumper( \%names );
However, both of these techniques suffer from the drawback that your script has to deal with loading and parsing the file at startup, and with re-dumping the relevant data at shutdown. You have to manually couple the data and the external file, which may be a bit of a drama. Furthermore, the external file (and I will bet here and now you won't write it in the probably soon to be lingua franca of everywhere, i.e. valid XML), will be your own proprietary format, that no other script will be able to use without copying the chunks of your script that deal with parsing and dumping. So you (sensibly) put the parse and dump routines in a module, so that more than one script can access the same data. If you're paranoid, you make sure you also lock these external files to prevent anyone else messing with them while you're using them:
#!/usr/bin/perl use strict; use Fcntl qw( :flock :seek ); open my $LATINNAMES, '>', "C:/dump.txt"; flock $LATINNAMES, LOCK_EX; # exclusive lock seek $LATINNAMES, 0, SEEK_END; # seek to the end of the file print $LATINNAMES "Drosera capensis :: Cape sundew\n"; flock $LATINNAMES, LOCK_UN; # unlock
The functions flock and seek are a tad
complex; flock takes two arguments, a filehandle to
(un)lock, and a number (1, 2, 4 or 8) that describes how you want to lock
the file (exclusive lock, shared lock, unlock, etc.). Since
remembering that 8 = unlock is a pain, the Fcntl module will
export four constant subroutines via the export tag
:flock that you can use instead: LOCK_UN is
just a funny looking subroutine call to something like:
sub LOCK_UN { return 8 }
seek is even more horrible, taking three arguments. It
moves the 'read/write head' in the filehandle to a particular position
(here, so that if someone has buggered about with the file between
opening and locking it, we can move the head back to where we want it).
The first argument to seek is a filehandle, the second is
the number of bytes from a particular offset in the file to move to, and
the third is that offset. Here the offset is SEEK_END
(numerically '2'), which is the end of the file, and we seek 0 bytes on
from it. Hence seek( FILE, 0, SEEK_END ) moves the
read/write head right to the end of the file. Bit convoluted, but there
you have it.
Of course, when you come to write your next quick hack, which you get
to dump data in an entirely different format, you have to write another
module, and so on, till you are crushed under the weight of your
ParseDump/*.pm modules, file locking utilities and a great
clumping mass of half-baked XML, plain text config file files and frozen,
Storable-d and Data::Dumper-ed variables.
Feeling ill yet? There is an easier way. How about dumping your data in a
format that everyone can interrogate in a standard way, from anywhere in
the world with a well known language? Enter the world of the relational
database:
Relational databases are a way of storing data, which may have complex relationships to other pieces of data. For example, we may want to store the details (customer name, invoice address) of all our customers in one 'table' (like a spreadsheet). We may also want to store a record (date, customer name, service) of every transaction we make with our many customers in another table. By using a language called SQL (simple query language), we can interrogate the database to get the names and addresses of all customers who have had a transaction with us in the last six months.
Relational databases (like XML) are used
everywhere, so sooner or later you will end up storing or accessing data
via one of them. They're also a very good place to store your
own persistent data. Fortunately, there is a freeware (at least for the
little people) database called MySQL
(there are others too), that'll run on Unix or Windows. To play with it,
you'll need to install it, and also install two perl modules,
DBI and DBD::mysql. The Database Interface
(DBI) module provides various standard methods (an API) that
allow you to interact with databases, and the Database Driver
(DBD) module provides the glue that sticks the interface to
a particular database: there's also a DBD::Oracle,
DBD::msql, etc. These modules allow you to play
with SQL from within Perl scripts using an abstracted, common API
i.e. you can use any database, Oracle, PostGreSQL,
etc., with no change in syntax, at least in the Perl parts, so
long as you tell perl which DBD driver module to use. This
is not the
place to learn SQL and database syntax. However, the basic concepts
are:
- A relational database interface such as MySQL will generally run as a server, and it can be interrogated by a client.
- Your perl script (via
DBIandDBD::mysql) acts as this client. - A database server can serve data from several databases.
- To create a database, we use the SQL syntax
CREATE DATABASE plantsfrom the command line MySQL shell. - Each database consists of one or more tables. Each table is just a
spreadsheet of some sort. Each row is a called a record, and contains
one or more items of data, corresponding to the columns of the table.
You could think of it as an array of hashrefs. There should be at least
one column (
latin,english, etc.) that is guaranteed to have a unique value for every record in the table (here it'slatin). Ths is called the primary key, and is usually ensured by creating an autoincrementing field in MySQL.@latin_names = ( { latin => "Drosera capensis ", english => "Cape sundew" }, { latin => "Dionaea muscipula ", english => "Venus flytrap" }, ); - To create a table, we use the SQL syntax:
CREATE TABLE latinnames (latin VARCHAR(100) NOT NULL PRIMARY KEY, english VARCHAR(100))This creates a table calledlatinnameswith two columns, both variable width character fields (max length 100 characters) calledlatinandenglish.The first is the primary key, and as such, must have a value (NOT NULL), although frequently, the best option is to create anAUTOINCREMENTINTeger field for this to ensure uniqueness. However, Latin names of plants are unique, so this will do as the primary key. - To create a record, we use:
INSERT INTO latinnames ( latin, english ) VALUES ( 'Drosera aliciae', 'Alice sundew' ). - To update a record we use:
UPDATE latinnames SET english='Albany pitcher plant' WHERE latin='Cephalotus follicularis'. - To delete a record we use:
DELETE FROM latinnames WHERE latin='Nepenthes rajah'. - To select data we use:
SELECT latin, english FROM latinnames WHERE latin='Drosera aliciae';If you use the character*in theSELECT, it will pull out all the fields. You can also use things likeWHERE latin REGEXP 'Drosera'to pull out several records that contain the word'Drosera', and so on. - To remove a table, we use
DROP latinnames.
There's plenty more syntax you can learn elsewhere, but this will be
enought to create something useful. To use SQL from within perl, we need
to connect to a database, prepare and execute an SQL statement, read the
result(s), and (eventually) disconnect from the database. In
DBI, the general scheme is therefore:
#!/usr/bin/perl
use strict;
use DBI;
my ( $username, $password ) = ( "USERNAME", "PASSWORD" );
my $dsn = "DBI:mysql:database=plants:host=localhost:port=3306";
my $dbh = DBI->connect( $dsn, $username, $password );
my $query_string = "SELECT * FROM latinnames WHERE latin='Nepenthes rajah'";
my $sth = $dbh->prepare( $query_string );
$sth->execute();
while ( my @results = $sth->fetchrow_array() )
{
print "$_\n" for @results;
}
$dbh->disconnect();
To interact with a database, we need to use DBI; The
first DBI method we need to deal with is the one that allows
you to connect to a database server. Helpfully enough this is a (class)
method called connect(), which serves the purpose of
new() in most OO schemes:
my $dbh = DBI->connect( $dsn, $username, $password);
connect returns a database handle (conventionally called
$dbh). The data source name (often written
$dsn) is a colon padded string that tells perl which
database to connect to and how, and it should look like:
DBI : dbd_driver_name : database=database_name :
host = host_name : port = port_number
For example:
DBI:mysql:database=plants:host=localhost:port=3306
This allows us to connect to the MySQL database plants,
running on the local machine on port 3306. Of course, it won't connect
unless you've actually created a database called plants.
This is left largely as an exercise for the reader: either use a DB
manager, or do it from the command line (see the CREATE
DATABASE syntax above).
After connecting, we prepare a query string. The prepare
method returns a statement handle ($sth), which we can then
execute() and then iterate over the results:
$sth->fetchrow_array() will keep returning an array
containing the fields we selected (*, i.e.
latin, english) until we run out of matches. When we're
done, we disconnect() the database handle. Simple.
A similar piece of code would allow us to create the
latinnames table itself, which is something of a
prerequisite for interrogation!
#!/usr/bin/perl
use strict;
use DBI;
my ( $username, $password ) = ( "USERNAME", "PASSWORD" );
my $dsn = "DBI:mysql:database=plants:host=localhost:port=3306";
my $dbh = DBI->connect( $dsn, $username, $password );
my $query_string =
"CREATE TABLE latinnames "
. "(latin VARCHAR(100) NOT NULL PRIMARY KEY, english VARCHAR(100))";
my $sth = $dbh->prepare( $query_string );
$sth->execute();
$dbh->disconnect();
As you can probably guess, all this database interfacing can get a bit tiresome, as the SQL statements are rather chunky. So to save our user the eye-strain, we can drop the low level SQL code into a module. Now we have two choices. We could just write a module/class with insert, update, delete, create, select, drop, etc., functions/methods, which is frequently useful. However, given our original idea, (several pages up now!), was to make a hash of Latin/English plant names persistent, wouldn't it be far nicer to somehow subvert how perl handles hashes, so that:
$names{'Dionaea muscipula'} = "Venus flytrap";
automagically calls the relevant insert or update methods, and
immediately changes the database contents? Sounds like it might be
complicated? Well, no. It's actually quite (not very, but
quite) simple. We need to create a tied hash,
and provide the implementation for this tying in an object oriented
module (TiePlants), and it will do exactly what we want.
With such an implementation, we can run this:
#!/usr/bin/perl
use strict;
use TiePlants;
tie my %names, 'TiePlants';
%names =
(
"Drosera capensis' => "Cape sundew",
"Dionaea muscipula" => "Venus flytrap",
);
and the database we use will have the data added. Later we can run an entirely different script:
#!/usr/bin/perl
use strict;
use TiePlants;
tie my %names, 'TiePlants';
while ( my ( $k, $v ) = each %names ) { print "$k is $v\n" }
and even though we haven't added anything to %names, it
will obligingly print out:
Dionaea muscipula is Venus flytrap Drosera capensis is Cape sundew
because the tie has automatically filled the hash with
the data from the database. Persistence is now ours! Well, not quite. We
still need to write the module that does the business. Fortunately, this
isn't very difficult. We simply need to define ten (I said it was
quite, not very, simple) methods that correspond to the
things perl does to hashes. These are TIEHASH,
FETCH, STORE, DELETE,
CLEAR, EXISTS, FIRSTKEY,
NEXTKEY, UNTIE and DESTROY, with
the capitalisation being a perl tradition that means 'this will be called
implicitly by perl', like BEGIN, AUTOLOAD and
END. First the preamble to the module:
package TiePlants; use 5.006; use strict; use warnings; use Exporter; our @ISA = qw( Exporter ); our $VERSION = '0.01';
Zzzzz…The usual suspects. Next we write the code that connects to the database when the module is loaded:
use Carp;
use DBI;
my %creds =
(
driver => "mysql",
database => "plants",
host => "localhost",
username => "Bob",
password => "CeeRiTParseWaD?",
port => "3306",
);
my $dbh = DBI->connect
(
"DBI:$creds{driver}:database=$creds{database};" .
"host=$creds{host};port=$creds{port}",
$creds{username},
$creds{password},
);
croak "Can't connect to database: $DBI::errstr\n" unless $dbh;
Carp is a module that allows you to warn and
die (that is, carp and croak) from
the perspective of the user's script, i.e. it tells the user
where the bug is in script.pl rather than where it got found
out in TiePlants.pm. This makes for happy debugging for
users, since there will obviously be no bugs in your module. The
rest is just connection code for the database, which will be run when the
module is loaded, and create a database handle to manipulate later. Note
that $DBI::errsrtr is a package variable that does for
DBI what $! does for perl. The next part of the
code to define is TIEHASH, which is called when a hash is
actually tied to the module's implementation:
sub TIEHASH
{
my ( $class, %configs ) = @_;
my $self = { LIST => {}, %configs };
croak "No table defined for tied hash\n" unless defined $self->{ table };
my $sth = $dbh->prepare( "SELECT * FROM $self->{table}" );
$sth->execute();
while ( my $result = $sth->fetchrow_hashref() )
{
$self->{LIST}{ $result->{latin} } = $result->{english};
}
return bless $self, $class;
}
This is the important bit. TIEHASH is passed all but the
first argument of tie:
tie ( %soon_to_be_tied_hash, "TiePlants", "some", "other", "arguments" );
Hence TIEHASH gets called with "TiePlants"
(the name of the class), and three other arguments. This will
croak an error, as we have decided (and will of course POD
document later) that the remaining arguments should come in key/value
pairs, allowing for some configuration, hence:
tie ( %soon_to_be_tied_hash, "TiePlants", table => "latinnames" );
is the way to call the tie in our scripts. There is no
need to capture the object returned by tie unless you want
to manipulate it directly (you probably don't). Now it will all work
nicely: the module implements the tied hash ($self) as a
hashref (as per usual for perl objects). At the top-most level, we store
the configuration options passed to TIEHASH
($self->{table} would return latinnames for
example), and we also store a reference to another anonymous hash that we
can access as $self->{LIST}. In the LIST
hashref we will store the actual tied hash data (i.e. the
key/value pairs from the database). This hashref needs populating, so we
interrogate the database for data. Interrogation consists of creating a
statement handle by prepare-ing an SQL query string from the
database handle. We then execute the query, and read data from the
statement handle (note that this has the potential to return an enormous
amount of data if your hash gets large). So remember, to execute an SQL
statement, just do three things: prepare,
execute and fetch:
my $sth = $dbh->prepare( $query_string );
$sth->execute( @placeholder_variables );
while ( my $result = $sth->fetchrow_hashref ) { blah... }
There are several ways to fetch the rows of data from the database:
rather than fetchrow_array(), here we use
fetchrow_hashref() instead. Each call to
fetchrow_hashref() returns a hashref containing key/value
pairs. In our latinname table, we have two columns, a Latin
name (called 'latin') and an English name (called 'english'). Hence each
result returned by fetchrow_hashref() will look something
like:
$result = { latin => 'Drosera aliciae', english => 'Alice sundew' };
And all we do is add key (latin) / value
(english) pairs to the hashref in
$self->{LIST}. We then bless and return this object.
That's all there is to creating a tied hash. Now we need to define the
nine methods for manipulating the hash/database.
FETCH fetches a value when passed a key ( print
$hash{'Dionaea muscipula'}; ). Since we will arrange to always
have a copy of the database data inside the tied object, we don't need to
interrogate the database to fetch values, we can simply return the value
of $self->{LIST}{ $key }. If you have a huge database, or
one that many clients will be manipulating simultaneously (you may wish
to lock the database if this is the case), then this will not be an
option, and you may need to interrogate the database on every
manipulation of the tied hash object. This is left as an exercise for the
reader. Note that in these instance methods, as usual, the tied hash
object is the first item passed to the method ($self).
sub FETCH
{
my ( $self, $key ) = @_;
return $self->{LIST}{ $key };
}
STORE stores a new pair of data in the hash.
sub STORE
{
my ( $self, $key, $value ) = @_;
$self->{LIST}{ $key } = $value;
my $sth = $dbh->prepare( "SELECT * FROM $self->{table} WHERE latin=?" );
$sth->execute( $key );
if ( $sth->fetchrow_hashref() ) # there's a record there already
{
$sth = $dbh->prepare
( "UPDATE $self->{table} SET latin=?, english=? WHERE latin=?" );
croak "Can't update new value: $DBI::errstr\n"
unless $sth->execute( $key, $value, $key );
}
else
{
$sth = $dbh->prepare
( "INSERT INTO $self->{table} (latin, english) VALUES (?, ?)" );
croak "Can't store new value: $DBI::errstr\n"
unless $sth->execute( $key, $value );
}
}
We need to update both the hash and the database when this happens, and the SQL syntax is different depending on whether we are creating a new hash item, or just updating a preexisting one. We check for this before deciding which syntax to use by seeing if there is a record with the same latin name already. We also have to use a placeholder in the SQL statement, since if we naively did this:
$sth = $dbh->prepare( "SELECT * FROM $self->{table} WHERE latin='$key' " );
then something bad would happen if $key was "Venus
flytrap", because just like Perl, SQL needs single quotes to be escaped
in quoted strings. To do this, we use a placeholder ? in our
prepare statement, and then fill them in using parameters to
execute():
$sth->execute( $placeholder_1, $placeholder_2, $placeholder_3, ... );
You can also use the $dbh->quote(); utility, but
placeholders are generally neater.
We now need a DELETE method, which will remove hash items
(delete $hash{ 'Dionaea muscipula' }; ) when required. Like
STORE, this needs to remove both the copy in the internal
hash, and the external copy in the database. To mimic perl's built in
delete exactly, it also needs to return the value of the
deleted item:
sub DELETE
{
my ( $self, $key ) = @_;
my $sth = $dbh->prepare( "DELETE FROM $self->{table} WHERE latin=?" );
$sth->execute( $key );
return delete $self->{LIST}{ $key };
}
CLEAR is the same as DELETE, but for every
item ( %hash = (); ). This is easily achieved by deleting
each item one by one with our already defined DELETE
methods:
sub CLEAR
{
my ( $self ) = @_;
$self->DELETE( $_ ) for keys %{ $self->{LIST} };
}
EXISTS determines whether an element exists or not. In
our example, we can simply ask perl whether it exists or not, since our
tied hash is essentially implemented as a hash internally anyway:
sub EXISTS
{
my ( $self, $key ) = @_;
return exists $self->{LIST}{ $key };
}
There are two iterator functions we need to define for a hash
implementation: the first tells us what the first item is, the
other tells us what the next item should be given the preceeding
one. These methods are called FIRSTKEY and
NEXTKEY respectively. Again, since we are implementing the
tied hash ultimately as a hash (with knobs on), we can just use
perl built-ins to generate these methods. If you were interrogating the
database directly (if it were enormous), you might need to implement some
sort of ordering within the database, so that items could be pulled out
in a replicable order. Most hash manipulations involve iterating over the
entire hash ( while( ( $k, $v ) = each %hash ) {
blah...} ), so you may not realise that there is actually
an internal pointer in the hash which the each function
increments on every call. To determine the first key, we need to reset
the iterator: keys does this implicitly, ensuring the next
call to each returns what perl considers as the first item
in the hash:
sub FIRSTKEY
{
my ( $self ) = @_;
my $reset_iterator = keys %{ $self->{LIST} };
my $firstkey = each %{ $self->{LIST} };
return $firstkey;
}
The NEXTKEY method gets the last key called as an
argument. Since we have the luxury of using perl built ins, we don't need
to worry about this, so we just chuck the value away, and call
each anyway. If you don't have an internal hash, you may
well need to use the last value to get the next one:
sub NEXTKEY
{
my ( $self, $lastkey ) = @_;
my $nextkey = each %{ $self->{LIST} };
return $nextkey;
}
The final two methods are called when the hash is untied
or destroyed. UNTIE is called when you call untie
%tied_hash, and can be used to clean up database connections, or
write data to an external file, and so forth. There's a subtle problem
here in that if you make use of the object returned by tie,
i.e. $tie_object = tie %tied_hash,
"TieClass"; then you will have more than one reference to the data
contained therein (one associated with the %tied_hash, one
with the $tie_object). UNTIE is called when the
hash is untied, but the actual tie object may still be
around, for example, if your user wants to mess with the insides of the
tie object somehow. UNTIE will be called with the number of
references left after untie-ing, so you can decide what to
do under the circumstances. Here, I've decided to leave the database
connection live if it seems like someone is messing with the innards; it
is ultimately closed when the object is destroyed (i.e. when
$tie_object, if it exists, has gone out of scope, and
%tied_hash has been untied.
sub UNTIE
{
my ( $self, $count ) = @_;
$dbh->disconnect() unless $count;
}
sub DESTROY
{
my ( $self ) = @_;
$dbh->disconnect();
}
And then the usual postamble:
1; __END__
Tied hashes are useful, since they are a transparent way of making
persistent data from a perl hash, which is readily manipulated by users
with familiar perl built-ins. You can make your methods do anything you
like: you can FETCH data from a command line program, or
STORE data to text files, or even pass them on to another
program that ultimately saves them somewhere entirely different. The
persistence of the data may not even be 'persistence' in this strict way:
it could be persistence in the sense of changes to environment variables,
configuration files, etc. Furthermore, you can also tie
arrays and scalars to a specific implementation (no prizes for guessing
you'll need to code TIESCALAR, POP,
SPLICE and PUSH methods), so if your data
better fits an array than a hash in its structure, there's still no
excuse for not tying the loose ends of that data up too.
For your convenience, you can download
the entire TiePlants.pm module.
