Lesson 5

Save it for later

Well, hacking on the symbol table is all well and good, but let's get back to practicalities. How do you mess about with files in Perl? Well, messing with files and directories is dead easy. A simple example:

#!/usr/bin/perl
use strict;
use warnings;
open my $INPUT, "<", "C:/autoexec.bat"
    or die "Can't open C:/autoexec.bat for reading $!\n";
open my $OUTPUT, ">", "C:/copied.bat"
    or die "Can't open C:/copied.bat for writing $!\n";
while ( <$INPUT> )
{
    print "Writing line $_";
    print $OUTPUT "$_";
}

Here we open two files, one to read from, one to write to. The $INPUT and $OUTPUT are filehandles, just like STDIN was, only we have created these two ourselves with open. It's a good idea to give filehandles uppercase names, as these are less likely to conflict with perl keywords (we don't want to try reading from a filehandle called print for example). Note that it's also possible to write the above in the following way:

#!/usr/bin/perl
use strict;
use warnings;
open INPUT, "C:/autoexec.bat"
    or die "Can't open C:/autoexec.bat for reading $!\n";
open OUTPUT, ">C:/copied.bat"
    or die "Can't open C:/copied.bat for writing $!\n";
while ( <INPUT> )
{
    print "Writing line $_";
    print OUTPUT "$_";
}

Note three things.

  1. You can miss off the $ sigil on the filehandles. Although this will work fine, modern Perl usage is to use a lexically scoped filehandle (except for the standard input, output and error handles that are opened automatically for you). You will see the old style filehandles in code, but you should avoid them if you are running under perl versions > 5.8, as they rely on dodgy global variables.
  2. You can miss off the < on calls to open, and perl will assume you mean 'to read'. It's better practice to explicitly state what you mean with the three argument form.
  3. You can also combine the read/write bit into the filename. However, both this and missing out the < on opening to read can be the cause of subtle bugs, so you'd be better to avoid them unless you really know what you're doing. Since you're reading this, I assume you don't…

The open command always needs two arguments: a filehandle and a string containing the name of a file to open. So the first line:

open INPUT, "<", "C:/autoexec.bat"
    or die "Can't open C:/autoexec.bat for reading $!\n";

means 'open the file C:/autoexec.bat for reading, and attach it to filehandle INPUT'. Now, if this works, everything will be fine, the open function will return TRUE, and the stuff after or will never be executed. However, if something does go wrong (like the file doesn't exist, as it won't if you're running on Linux or MacOS), the open function will return FALSE, and the thing after the or will be executed. die causes the Perl program to terminate, with the message you give it (think of it as a suicidal print). When something goes wrong, like problems opening files, the Perl special variable $! is set with an error message, which will tell you what went wrong. So this die tells you what you couldn't do, followed by $!, which'll probably contain 'No such file or directory' or similar.

A word of advice before we go any further. On Windows, paths are delimited using the \ backslash. On Unix, paths are delimited using the / forward-slash, on MacOS < X, I have no idea (colon?). Perl will happily accept either of these when running under Windows, but bear in mind \ is an escape, so to write it in a string, you'll have to escape it, thusly:

$file = "C:/autoexec.bat";
$file = "C:\\autoexec.bat";

I'd go with the first one in the name of portability and legibility, although if you ever need to call an external program from perl (using system, more later), you'll probably have to convert the / to \ with a s/\//\\/

The second line:

open OUTPUT, ">", "C:/copied.bat"
    or die "Can't open C:/copied.bat for writing $!\n";

is very similar to the first, but here we are opening a file for writing. The difference is the >:

open my $READ, "<C:/autoexec.bat";           # explicit < for reading
open my $READ, "<", "C:/autoexec.bat";       # three argument version is safer
open my $WRITE, ">C:/autoexec.bat";          # open for writing with >
open my $WRITE, ">", "C:/autoexec.bat";      # safer
open my $APPEND, ">>C:/autoexec.bat";     # open for appending with >>
open my $APPEND, ">>", "C:/autoexec.bat"; # safer
open my $READ, "C:/autoexec.bat";               # perl will assume you 'read'

The > means open the file for writing. If you do this the file will be erased and then written to. If you don't want to wipe the file first, use >>, which opens the file for writing, but doesn't clobber the contents first. The three argument versions are generally safer (consider whether you want this to work:

chomp( my $file_name = <STDIN> );
# user types ">important_file"
open my $FILE, $file_name;
# the writer assumes for reading, but the > the user enters overrides this. Oops.

The next bit is easy:

while ( <$INPUT> )
{
    print "Writing line $_";
    print $OUTPUT "$_";
}

Remember the line reading angle brackets <> ? As in:

chomp ( $name = <STDIN> );

This is the same, but here we are reading lines from our own filehandle, INPUT. A line is defined as stuff up to and including a newline character (just as it was when you were reading things from the keyboard). [And you also know this is strictly a fib, <> and chomp deal with lines delimited by whatever is in $/ currently]. Conveniently:

while ( <$INPUT> )

is a shorthand for:

while ( defined ( $_ = <$INPUT> ) )

i.e. while there are lines to read, read them into $_. The defined will eventually return FALSE when it gets to the end of the file (don't test for eof explicitly!), and then the while loop will terminate. However, while there really is stuff to read, perl will print to the command line "writing line blah…", then print it to the OUTPUT filehandle too using:

print $OUTPUT "$_";

Note that there is no comma between the filehandle and the thing to print. A normal print:

print "Hello\n";

is actually shorthand for:

print STDOUT "Hello\n";

where STDOUT is the standard output (i.e. the screen), like STDIN was the standard input (i.e. the keyboard). To print to a filehandle other than the default STDOUT, you need to tell print the filehandle name explicitly.

What else can we do with filehandles? As well as opening them to read and write files, we can also open them as 'pipes' to external programs, using the | symbol, rather than > or <.

open my $PIPE_FROM_ENV, "-|", "env" or die $!;
print "$_\n" while ( <$PIPE_FROM_ENV> );

This should (as long as your operating system has a program called env) print out your environmental variables. The open command:

open my $PIPE_FROM_ENV, "-|", "env" or die $!;

means 'open a filehandle called PIPE_FROM_ENV, and attach it to the output of the command env run from the command line'. You can then read lines from the output of 'env' using the <> as usual.

You can also pipe stuff into an external program like this:

open my $PIPE_TO_X, "|-", "some_program" or die $!;
print $PIPE_TO_X "Something that means something useful to some_program";

Note the or die $! : it's always important to check the return value of external commands, like open, to make sure something funny isn't going on. Get into the habit early: it's surprising how often the file that can't possible be missing actually is…

An even more common way of executing external programs is to use system. system is useful for running external programs that do something with some data that perl has just created, and for running other external programs:

system "DIR";

Will run the program DIR from the shell, should it exist. Given it doesn't exist on anything but Windows (please tell me no-one out there still has a computer running nothing but MS-DOS), there's no point in running it unless the OS is correct. Perl has the OS name (sort of) in a punctuation variable. Try running:

print $^O;
MSWin32

to find out what perl thinks your OS is called.

system is a weird command: it generally returns FALSE when it works. Hence:

#!/usr/bin/perl
use strict;
use warnings;
if ( $^O eq "MSWin32") { system "dir" or warn "Couldn't run dir $!\n" }
else { print "Not a Windows machine.\n" }

will give spurious warnings. Here we have used warn instead of die: warn does largely the same thing as die, but doesn't actually exit: it just prints a warning. [As you may guess from my 'coding' the word exit, if you want to kill a perl program happily (rather than unhappily, with die), use exit.

print "Message to STDOUT\n";
warn "Message to STDERR\n";
exit 0; # exits program gracefully with return code 0
die "Whinge to STDERR\n"; # exits program with an error message

What you actually need for system is the utterly bizarre:

system "dir" and warn "Couldn't run dir $!\n";

a (historically explicable, but still bizarre) wart that will be fixed in Perl 6. By the way, perl actually opens three filehandles when it starts up: STDIN, STDOUT and STDERR. You've met the first two already. STDERR is the filehandle warnings, dyings and other whingings are printed to: it is also connected to the screen by default, just like STDOUT, but is actually a different filehandle:

warn "bugger";

and

print STDERR "bugger";

have largely the same effect. There's no reason why you can't close and re-open a filehandle, even one of the three default ones:

#!/usr/bin/perl
use strict;
use warnings;
close STDERR;
open STDERR, ">>errors.log";
warn "You won't see this on the screen, but you'll find it in the error log";

You have now met two of Perl's logical operators, or and and. Perl has several others, including not and xor. It also has a set stolen from C that look like line-noise: ||, && and !, which also mean 'or', 'and' and 'not', but bind more tightly to their operands. Hence:

open my $FILE, "<", "C:/file.txt" or die "oops";

will work fine, because the precedence of or (and all the wordy logic operators) is very low, i.e. perl thinks this means:

open( my $FILE, "<", "C:/file.txt" ) or die "oops";

because or has an even lower precedence than the comma that separates the items of the list. However, perl thinks that:

open my $FILE, "<", "C:/file.txt" || die "oops";

means

open my $FILE, "<", ( "C:/file.txt" || die "oops" );

because || has a much higher precedence than the comma. Since "C:/file.txt" is TRUE (it's defined, and not the number 0), perl will never see 'die "oops"'. The logical operators like &&, or and || return whatever they last evaluated, here C:/file.txt, so perl will try and open this file, but if it doesn't exist, there is nothing more to do and you will get no warning that something has gone wrong. The upshot: don't use || when you should use or, or make sure you put in the brackets yourself:

open( FILE, "<", "C:/file.txt" ) || die "oops";

Operator precedence is boring, but important. If you are worried, bung in parentheses to ensure it does what you mean. Generally perl DWIMs (particularly if you're a C programmer), but don't always count on it, especially if you're doing something complicatedly line-noisy.

One last way of executing things from the shell is to use ` ` backticks. These work just like the quote operators, and will happily interpolate variables (as will system "$blah @args" for that matter), but they actually capture the output into a variable:

my $output = `ls`;
print $output;

Like qq() and q() and qw(), there is also a qx() (quote execute) operator, which is just like backticks, only you chose your own quotes:

my @output = qx:ls:;

Handling directories is a simple as handling files:

#!/usr/bin/perl
use strict;
use warnings;
opendir my $DIR, ".";
while ( defined( $_ = readdir $DIR ) )
{
    print "$_\n";
}

Here's a program that changes to a new directory, and spews out stuff about the contents to a file called ls.txt in the new directory.

#!/usr/bin/perl
use strict;
use warnings;
my $dir = shift @ARGV;
chdir $dir or die "Can't change to $dir: $!";
opendir my $DIR, "."
    or die "Can't opendir $dir: $!\n"; # the new CWD, to which we changed
open my $OUTPUT, ">", "ls.txt" or die "Can't open ls.txt for writing: $!";
while ( defined ( $_ = readdir $DIR ) )
{
    if    ( -d $_ ) { print $OUTPUT "directory $_\n" }
    elsif ( -f $_ ) { print $OUTPUT "file $_\n" }
}
close $OUTPUT or die "Can't close ls.txt: $!\n";
    # pedants will want to use an 'or die' here
closedir $DIR or die "Can't closedir $dir: $!";
    # perl will close things itself, but it doesn't hurt to be explicit

There are a few new things here. @ARGV you may recognise from the symbol table programs. This is another special perl variable, like $_ and $a. It contains the arguments you passed to the program on the command line. Hence to run this program you will need to type:

perl thing.pl d:/some/directory/or/other

@ARGV will contain a list of the single value d:/some/directory/or/other, which you can get out using any array operator of your choice. In fact, pop and shift will automatically assume @ARGV in the body of the program, so you could equally well write..

my $dir = shift;

and get the same effect. This should remind you of subroutines, the only difference is that array operators default to @ARGV in the body, and @_ in a sub. The V stands for 'vector' if you're interested, it's a hangover from C.

The rest of the program is self explanatory, except for the -f and -d. Not too surprisingly, these are 'file test' operators. -f tests to see if a file is a file, and -d tests to see if a file is a directory. So:

-f "C:/autoexec.bat"

will return TRUE, as will:

-d "C:/windows"

as long as they exist! Perl has a variety of other file test operators, such as -T, which tests to see if a file is a plain text file, -B, which tests for binary-ness, and -M, which returns the age of a file in days at the time the script started. The others can be found using perldoc.

RTFPD: read the perldoc

perldoc is perl's own command line manual: if you type:

perldoc -f sort

at the command prompt, perldoc will get all the documentation for the perl function sort (the -f is a switch for f(unction) documentation), and display it for you. Likewise:

perldoc -f -x

will get you information on file test operators (generically called '-x' functions). For really general stuff:

perldoc perl

will get you general information on perl itself, and:

perldoc MODULE_NAME

e.g.:

perldoc strict

will extract internal documentation from modules (including pragma modules like strict) to tell you how to use them. This internal documentation is written in POD (plain old documentation) format, which we'll cover when we get onto writing modules. Lastly:

perldoc -h

or amusingly:

perldoc perldoc

will tell you how to use perldoc itself, which contains all the other information for its correct use I can't be bothered to write out here.

Summary

Next up, regexes, but first a quick summary. Opening files looks like:

open my $FILEHANDLE, $RW, $file_to_open; # note the commas

If $RW looks like "blah", it'll be opened for reading, if ">blah", for writing, if ">>blah", for appending, and if "-|", opened as a pipe from an external command called blah, if "|-" as a pipe to an external program.

You should always check return values of open to make sure the file exists, with or die $! or similar, which prints to the STDERR filehandle, as does warn. External commands can also be run with system (don't forget the counterintuitive 'and die $!'), backticks, or the qx() quotes. Read from files with the <$FILEHANDLE> angle brackets, print to them with:

print $FILEHANDLE "parp"; # note the lack of comma

and close them with close.

Use opendir, readdir, rewinddir, chdir and closedir to investigate directories (with or die as appropriate), and the file-test operators -x to investigate files and directories. And if in doubt, use the perldoc.

Test yourself

See if you can write a script that does the following:

#!/usr/bin/perl
use strict;
use warnings;
mkdir "environment", 0777 or die "Can't mkdir 'environment': $!\n";
open my $FILE, ">", "environment/list.txt" 
    or die "Can't open 'list.txt' for writing: $!\n";
my $env = `env`;
print $FILE $env or die "Can't print to 'list.txt': $!\n";
    # can't be too careful
close $FILE or die "Can't close file: $!\n";

Next…