Lesson 15

Tied up in knots

Sometimes you'll want to make a data structure more persistent than the runtime of a perl program, so you can write and read the same (or updated) data between successive runs of the script. Of course, there is more than one way to do this. Say you have a hash whose keys and values you would like to save persistently: maybe these are configuration options, or maybe they are a dictionary of Latin to English plant names. You could:

#!/usr/bin/perl
use strict;
my %names = ( "Drosera capensis" => "Cape sundew" );
open my $LATINNAMES, '>', shift or die $!;
while ( my ( $k, $v ) = each %names )
{
    print $LATINNAMES
      "<item>\n\t<latin>$k</latin>\t<english>$v</english>\n</item>";
}
#!/usr/bin/perl
use strict;
my %names = ( "Drosera capensis" => "Cape sundew" );
open my $LATINNAMES, '>', shift or die $!;
print $LATINNAMES Dumper( \%names );

However, both of these techniques suffer from the drawback that your script has to deal with loading and parsing the file at startup, and with re-dumping the relevant data at shutdown. You have to manually couple the data and the external file, which may be a bit of a drama. Furthermore, the external file (and I will bet here and now you won't write it in the probably soon to be lingua franca of everywhere, i.e. valid XML), will be your own proprietary format, that no other script will be able to use without copying the chunks of your script that deal with parsing and dumping. So you (sensibly) put the parse and dump routines in a module, so that more than one script can access the same data. If you're paranoid, you make sure you also lock these external files to prevent anyone else messing with them while you're using them:

#!/usr/bin/perl
use strict;
use Fcntl qw( :flock :seek );
open  my $LATINNAMES, '>', "C:/dump.txt";
flock    $LATINNAMES, LOCK_EX;     # exclusive lock
seek     $LATINNAMES, 0, SEEK_END; # seek to the end of the file
print    $LATINNAMES "Drosera capensis :: Cape sundew\n";
flock    $LATINNAMES, LOCK_UN;     # unlock

The functions flock and seek are a tad complex; flock takes two arguments, a filehandle to (un)lock, and a number (1, 2, 4 or 8) that describes how you want to lock the file (exclusive lock, shared lock, unlock, etc.). Since remembering that 8 = unlock is a pain, the Fcntl module will export four constant subroutines via the export tag :flock that you can use instead: LOCK_UN is just a funny looking subroutine call to something like:

sub LOCK_UN { return 8 }

seek is even more horrible, taking three arguments. It moves the 'read/write head' in the filehandle to a particular position (here, so that if someone has buggered about with the file between opening and locking it, we can move the head back to where we want it). The first argument to seek is a filehandle, the second is the number of bytes from a particular offset in the file to move to, and the third is that offset. Here the offset is SEEK_END (numerically '2'), which is the end of the file, and we seek 0 bytes on from it. Hence seek( FILE, 0, SEEK_END ) moves the read/write head right to the end of the file. Bit convoluted, but there you have it.

Of course, when you come to write your next quick hack, which you get to dump data in an entirely different format, you have to write another module, and so on, till you are crushed under the weight of your ParseDump/*.pm modules, file locking utilities and a great clumping mass of half-baked XML, plain text config file files and frozen, Storable-d and Data::Dumper-ed variables. Feeling ill yet? There is an easier way. How about dumping your data in a format that everyone can interrogate in a standard way, from anywhere in the world with a well known language? Enter the world of the relational database:

Relational databases are a way of storing data, which may have complex relationships to other pieces of data. For example, we may want to store the details (customer name, invoice address) of all our customers in one 'table' (like a spreadsheet). We may also want to store a record (date, customer name, service) of every transaction we make with our many customers in another table. By using a language called SQL (simple query language), we can interrogate the database to get the names and addresses of all customers who have had a transaction with us in the last six months.

Relational databases (like XML) are used everywhere, so sooner or later you will end up storing or accessing data via one of them. They're also a very good place to store your own persistent data. Fortunately, there is a freeware (at least for the little people) database called MySQL (there are others too), that'll run on Unix or Windows. To play with it, you'll need to install it, and also install two perl modules, DBI and DBD::mysql. The Database Interface (DBI) module provides various standard methods (an API) that allow you to interact with databases, and the Database Driver (DBD) module provides the glue that sticks the interface to a particular database: there's also a DBD::Oracle, DBD::msql, etc. These modules allow you to play with SQL from within Perl scripts using an abstracted, common API i.e. you can use any database, Oracle, PostGreSQL, etc., with no change in syntax, at least in the Perl parts, so long as you tell perl which DBD driver module to use. This is not the place to learn SQL and database syntax. However, the basic concepts are:

There's plenty more syntax you can learn elsewhere, but this will be enought to create something useful. To use SQL from within perl, we need to connect to a database, prepare and execute an SQL statement, read the result(s), and (eventually) disconnect from the database. In DBI, the general scheme is therefore:

#!/usr/bin/perl
use strict;
use DBI;
my ( $username, $password ) = ( "USERNAME", "PASSWORD" );
my $dsn = "DBI:mysql:database=plants:host=localhost:port=3306";
my $dbh = DBI->connect( $dsn, $username, $password );
my $query_string = "SELECT * FROM latinnames WHERE latin='Nepenthes rajah'";
my $sth = $dbh->prepare( $query_string );
$sth->execute();
while ( my @results = $sth->fetchrow_array() )
{
    print "$_\n" for @results;
}
$dbh->disconnect();

To interact with a database, we need to use DBI; The first DBI method we need to deal with is the one that allows you to connect to a database server. Helpfully enough this is a (class) method called connect(), which serves the purpose of new() in most OO schemes:

my $dbh = DBI->connect( $dsn, $username, $password);

connect returns a database handle (conventionally called $dbh). The data source name (often written $dsn) is a colon padded string that tells perl which database to connect to and how, and it should look like:

DBI : dbd_driver_name : database=database_name : 
    host = host_name : port = port_number

For example:

DBI:mysql:database=plants:host=localhost:port=3306

This allows us to connect to the MySQL database plants, running on the local machine on port 3306. Of course, it won't connect unless you've actually created a database called plants. This is left largely as an exercise for the reader: either use a DB manager, or do it from the command line (see the CREATE DATABASE syntax above).

After connecting, we prepare a query string. The prepare method returns a statement handle ($sth), which we can then execute() and then iterate over the results: $sth->fetchrow_array() will keep returning an array containing the fields we selected (*, i.e. latin, english) until we run out of matches. When we're done, we disconnect() the database handle. Simple.

A similar piece of code would allow us to create the latinnames table itself, which is something of a prerequisite for interrogation!

#!/usr/bin/perl
use strict;
use DBI;
my ( $username, $password ) = ( "USERNAME", "PASSWORD" );
my $dsn = "DBI:mysql:database=plants:host=localhost:port=3306";
my $dbh = DBI->connect( $dsn, $username, $password );
my $query_string = 
      "CREATE TABLE latinnames "
    . "(latin VARCHAR(100) NOT NULL PRIMARY KEY, english VARCHAR(100))";
my $sth = $dbh->prepare( $query_string );
$sth->execute();
$dbh->disconnect();

As you can probably guess, all this database interfacing can get a bit tiresome, as the SQL statements are rather chunky. So to save our user the eye-strain, we can drop the low level SQL code into a module. Now we have two choices. We could just write a module/class with insert, update, delete, create, select, drop, etc., functions/methods, which is frequently useful. However, given our original idea, (several pages up now!), was to make a hash of Latin/English plant names persistent, wouldn't it be far nicer to somehow subvert how perl handles hashes, so that:

$names{'Dionaea muscipula'} = "Venus flytrap";

automagically calls the relevant insert or update methods, and immediately changes the database contents? Sounds like it might be complicated? Well, no. It's actually quite (not very, but quite) simple. We need to create a tied hash, and provide the implementation for this tying in an object oriented module (TiePlants), and it will do exactly what we want. With such an implementation, we can run this:

#!/usr/bin/perl
use strict;
use TiePlants;
tie my %names, 'TiePlants';
%names =
(
    "Drosera capensis' => "Cape sundew",
    "Dionaea muscipula" => "Venus flytrap",
);

and the database we use will have the data added. Later we can run an entirely different script:

#!/usr/bin/perl
use strict;
use TiePlants;
tie my %names, 'TiePlants';
while ( my ( $k, $v ) = each %names ) { print "$k is $v\n" }

and even though we haven't added anything to %names, it will obligingly print out:

Dionaea muscipula is Venus flytrap
Drosera capensis is Cape sundew

because the tie has automatically filled the hash with the data from the database. Persistence is now ours! Well, not quite. We still need to write the module that does the business. Fortunately, this isn't very difficult. We simply need to define ten (I said it was quite, not very, simple) methods that correspond to the things perl does to hashes. These are TIEHASH, FETCH, STORE, DELETE, CLEAR, EXISTS, FIRSTKEY, NEXTKEY, UNTIE and DESTROY, with the capitalisation being a perl tradition that means 'this will be called implicitly by perl', like BEGIN, AUTOLOAD and END. First the preamble to the module:

package TiePlants;
use 5.006;
use strict;
use warnings;
use Exporter;
our @ISA = qw( Exporter );
our $VERSION = '0.01';

Zzzzz…The usual suspects. Next we write the code that connects to the database when the module is loaded:

use Carp;
use DBI;
my %creds =
(
    driver   => "mysql",
    database => "plants",
    host     => "localhost",
    username => "Bob",
    password => "CeeRiTParseWaD?",
    port     => "3306",
);
my $dbh = DBI->connect
(
    "DBI:$creds{driver}:database=$creds{database};" .
      "host=$creds{host};port=$creds{port}",
    $creds{username},
    $creds{password},
);
croak "Can't connect to database: $DBI::errstr\n" unless $dbh;

Carp is a module that allows you to warn and die (that is, carp and croak) from the perspective of the user's script, i.e. it tells the user where the bug is in script.pl rather than where it got found out in TiePlants.pm. This makes for happy debugging for users, since there will obviously be no bugs in your module. The rest is just connection code for the database, which will be run when the module is loaded, and create a database handle to manipulate later. Note that $DBI::errsrtr is a package variable that does for DBI what $! does for perl. The next part of the code to define is TIEHASH, which is called when a hash is actually tied to the module's implementation:

sub TIEHASH
{
    my ( $class, %configs ) = @_;
    my $self = { LIST => {}, %configs };
    croak "No table defined for tied hash\n" unless  defined $self->{ table };
    my $sth = $dbh->prepare( "SELECT * FROM $self->{table}" );
    $sth->execute();
    while ( my $result = $sth->fetchrow_hashref() )
    {
        $self->{LIST}{ $result->{latin} } = $result->{english};
    }
    return bless $self, $class;
}

This is the important bit. TIEHASH is passed all but the first argument of tie:

tie ( %soon_to_be_tied_hash, "TiePlants", "some", "other", "arguments" );

Hence TIEHASH gets called with "TiePlants" (the name of the class), and three other arguments. This will croak an error, as we have decided (and will of course POD document later) that the remaining arguments should come in key/value pairs, allowing for some configuration, hence:

tie ( %soon_to_be_tied_hash, "TiePlants", table => "latinnames" );

is the way to call the tie in our scripts. There is no need to capture the object returned by tie unless you want to manipulate it directly (you probably don't). Now it will all work nicely: the module implements the tied hash ($self) as a hashref (as per usual for perl objects). At the top-most level, we store the configuration options passed to TIEHASH ($self->{table} would return latinnames for example), and we also store a reference to another anonymous hash that we can access as $self->{LIST}. In the LIST hashref we will store the actual tied hash data (i.e. the key/value pairs from the database). This hashref needs populating, so we interrogate the database for data. Interrogation consists of creating a statement handle by prepare-ing an SQL query string from the database handle. We then execute the query, and read data from the statement handle (note that this has the potential to return an enormous amount of data if your hash gets large). So remember, to execute an SQL statement, just do three things: prepare, execute and fetch:

my $sth = $dbh->prepare( $query_string );
$sth->execute( @placeholder_variables );
while ( my $result = $sth->fetchrow_hashref ) { blah... }

There are several ways to fetch the rows of data from the database: rather than fetchrow_array(), here we use fetchrow_hashref() instead. Each call to fetchrow_hashref() returns a hashref containing key/value pairs. In our latinname table, we have two columns, a Latin name (called 'latin') and an English name (called 'english'). Hence each result returned by fetchrow_hashref() will look something like:

$result = { latin => 'Drosera aliciae', english => 'Alice sundew' };

And all we do is add key (latin) / value (english) pairs to the hashref in $self->{LIST}. We then bless and return this object. That's all there is to creating a tied hash. Now we need to define the nine methods for manipulating the hash/database.

FETCH fetches a value when passed a key ( print $hash{'Dionaea muscipula'}; ). Since we will arrange to always have a copy of the database data inside the tied object, we don't need to interrogate the database to fetch values, we can simply return the value of $self->{LIST}{ $key }. If you have a huge database, or one that many clients will be manipulating simultaneously (you may wish to lock the database if this is the case), then this will not be an option, and you may need to interrogate the database on every manipulation of the tied hash object. This is left as an exercise for the reader. Note that in these instance methods, as usual, the tied hash object is the first item passed to the method ($self).

sub FETCH
{
    my ( $self, $key ) = @_;
    return $self->{LIST}{ $key };
}

STORE stores a new pair of data in the hash.

sub STORE
{
    my ( $self, $key, $value ) = @_;
    $self->{LIST}{ $key } = $value;
    my $sth = $dbh->prepare( "SELECT * FROM  $self->{table} WHERE latin=?" );
    $sth->execute( $key );
    if ( $sth->fetchrow_hashref() ) # there's a record there already
    {
        $sth = $dbh->prepare
            ( "UPDATE $self->{table} SET latin=?, english=? WHERE latin=?" );
        croak "Can't update new value: $DBI::errstr\n" 
            unless $sth->execute( $key, $value, $key );
    }
    else
    {
        $sth = $dbh->prepare
            ( "INSERT INTO $self->{table} (latin, english) VALUES (?, ?)" );
        croak "Can't store new value: $DBI::errstr\n" 
            unless $sth->execute( $key, $value );
    }
}

We need to update both the hash and the database when this happens, and the SQL syntax is different depending on whether we are creating a new hash item, or just updating a preexisting one. We check for this before deciding which syntax to use by seeing if there is a record with the same latin name already. We also have to use a placeholder in the SQL statement, since if we naively did this:

$sth = $dbh->prepare( "SELECT * FROM $self->{table} WHERE latin='$key' " );

then something bad would happen if $key was "Venus flytrap", because just like Perl, SQL needs single quotes to be escaped in quoted strings. To do this, we use a placeholder ? in our prepare statement, and then fill them in using parameters to execute():

   $sth->execute( $placeholder_1, $placeholder_2, $placeholder_3, ... );

You can also use the $dbh->quote(); utility, but placeholders are generally neater.

We now need a DELETE method, which will remove hash items (delete $hash{ 'Dionaea muscipula' }; ) when required. Like STORE, this needs to remove both the copy in the internal hash, and the external copy in the database. To mimic perl's built in delete exactly, it also needs to return the value of the deleted item:

sub DELETE
{
    my ( $self, $key ) = @_;
    my $sth = $dbh->prepare( "DELETE FROM $self->{table} WHERE latin=?" );
    $sth->execute( $key );
    return delete $self->{LIST}{ $key };
}

CLEAR is the same as DELETE, but for every item ( %hash = (); ). This is easily achieved by deleting each item one by one with our already defined DELETE methods:

sub CLEAR
{
    my ( $self ) = @_;
    $self->DELETE( $_ ) for keys %{ $self->{LIST} };
}

EXISTS determines whether an element exists or not. In our example, we can simply ask perl whether it exists or not, since our tied hash is essentially implemented as a hash internally anyway:

sub EXISTS
{
    my ( $self, $key ) = @_;
    return exists $self->{LIST}{ $key };
}

There are two iterator functions we need to define for a hash implementation: the first tells us what the first item is, the other tells us what the next item should be given the preceeding one. These methods are called FIRSTKEY and NEXTKEY respectively. Again, since we are implementing the tied hash ultimately as a hash (with knobs on), we can just use perl built-ins to generate these methods. If you were interrogating the database directly (if it were enormous), you might need to implement some sort of ordering within the database, so that items could be pulled out in a replicable order. Most hash manipulations involve iterating over the entire hash ( while( ( $k, $v ) = each %hash ) { blah...} ), so you may not realise that there is actually an internal pointer in the hash which the each function increments on every call. To determine the first key, we need to reset the iterator: keys does this implicitly, ensuring the next call to each returns what perl considers as the first item in the hash:

sub FIRSTKEY
{
    my ( $self )       = @_;
    my $reset_iterator = keys %{ $self->{LIST} };
    my $firstkey       = each %{ $self->{LIST} };
    return $firstkey;
}

The NEXTKEY method gets the last key called as an argument. Since we have the luxury of using perl built ins, we don't need to worry about this, so we just chuck the value away, and call each anyway. If you don't have an internal hash, you may well need to use the last value to get the next one:

sub NEXTKEY
{
    my ( $self, $lastkey ) = @_;
    my $nextkey = each %{ $self->{LIST} };
    return $nextkey;
}

The final two methods are called when the hash is untied or destroyed. UNTIE is called when you call untie %tied_hash, and can be used to clean up database connections, or write data to an external file, and so forth. There's a subtle problem here in that if you make use of the object returned by tie, i.e. $tie_object = tie %tied_hash, "TieClass"; then you will have more than one reference to the data contained therein (one associated with the %tied_hash, one with the $tie_object). UNTIE is called when the hash is untied, but the actual tie object may still be around, for example, if your user wants to mess with the insides of the tie object somehow. UNTIE will be called with the number of references left after untie-ing, so you can decide what to do under the circumstances. Here, I've decided to leave the database connection live if it seems like someone is messing with the innards; it is ultimately closed when the object is destroyed (i.e. when $tie_object, if it exists, has gone out of scope, and %tied_hash has been untied.

sub UNTIE
{
    my ( $self, $count ) = @_;
    $dbh->disconnect() unless $count;
}
sub DESTROY
{
    my ( $self ) = @_;
    $dbh->disconnect();
}

And then the usual postamble:

1;
__END__

Tied hashes are useful, since they are a transparent way of making persistent data from a perl hash, which is readily manipulated by users with familiar perl built-ins. You can make your methods do anything you like: you can FETCH data from a command line program, or STORE data to text files, or even pass them on to another program that ultimately saves them somewhere entirely different. The persistence of the data may not even be 'persistence' in this strict way: it could be persistence in the sense of changes to environment variables, configuration files, etc. Furthermore, you can also tie arrays and scalars to a specific implementation (no prizes for guessing you'll need to code TIESCALAR, POP, SPLICE and PUSH methods), so if your data better fits an array than a hash in its structure, there's still no excuse for not tying the loose ends of that data up too.

For your convenience, you can download the entire TiePlants.pm module.

Next…