Lesson 14

We've come a long way…

…from the 'Hello, world' script. I guess by now, you should be able to do most of the following with your eyes shut:

That's really quite impressive! Just to bring you back down to Earth, we're going to start back at the very beginning all over again.

Hello, world

You may recognise this from an earlier lesson or three:

#include <stdio.h>
int main()
{
    printf( "Hello, world\n" );
    exit( 0 );
}

This is the benighted script in C. Why, in a Perl tutorial, should you give a damn about programming in this centuries-old glorified Assembler language ;) ? Well, there's one very good reason: perl itself is written in C, and in this lesson we will be delving a little into perl's guts, and messing with them. The old way of doing this was via the XS mechanism, whereby you wrote a module in Perl and a module written in a macro-language called XS (a sort of bastard love-child of C, Perl, English and pain). You then wrote a makefile for the modules, make-d and compiled them, and then got bored and decided to implement it in pure Perl anyway as your head hurt. No longer. We will be using the Inline modules instead, which have nearly all the power of XS, but without the grief of having to actually do anything.

To install Inline, all you need to do is:

ppm install Inline

or

perl -MCPAN -e shell
install Inline

However, as I mentioned before, there is a problem: if you are running the ActiveState port of perl under Windows, you will need cl.exe, the C/C++ compiler (and its libraries and linker) from MS Visual C++ Studio v6.0 (this is also the case if you want to compile XS extensions). You'll also need nmake. To play with Inline under WinNT, I'd strongly recommend installing the (free) Cygwin environment, with the Cygwin ports of perl, gcc (the GNU C compiler with which Cygwin perl is compiled) and make, then install Inline for this binary instead. You can then invoke your Inline-d Perl scripts from the Cygwin bash shell (make sure your shebang is correct though).

Let's look at the Inline::C version of the world's most famous program:

#!/usr/bin/perl
use strict;
use Inline 'C';
hello();
__END__
__C__
int hello()
{
    printf( "Hello, world\n" );
    exit( 0 );
}

To use Inline, you need to tell it which language you want to use, i.e. 'C', then include the program in the Perl script somehow. There are several ways to do this: we'll use this one, where we just dump the C-code after the __END__ marker in a section starting __C__.

If your C programing experience is non-existent, then the rest of this lesson may be a little confusing. You might want to check out a C tutorial first.

To execute the script, all you need to do is save it as script.pl, and run it:

script.pl
...time passes...
Hello, world

Whoohoo! There are a number of things that can do wrong, the obvious one of which is writing buggy code, but the other is due to Inline not finding a place to build the C components of the script. If you have the latter problem, try creating an environment variable called PERL_INLINE_DIRECTORY with the value c:/cygwin/tmp/inline or similar (you'll need to actually create this directory, obviously).

Now, try running the script again:

script.pl
...very little time passes...
Hello, world

You may notice the whole thing is rather quicker this time. The reason for this is the first time you invoke an Inline-d script, Inline does all the nasty building (compiling, assembling, linking and installing) that is required to get the Perl/C interface to work i.e. stripping out the C code from your script, transforming it into an XS module that binds your C subroutines to perl subroutines, writing a makefile.pl, executing it, running make, testing the code and finally compiling it with make install. This takes a while. However, after this, Inline will realise it has already compiled the code, and doesn't go through the rigmarole the second time: it just uses the extension it has already built.

So far so easy

Now the good stuff. Creating a script that just printfs something dull isn't very useful. What happens if we want to send data to and from the subroutine? Unfortunately, you'll need to know a little about how perl actually works to do this, because the fundamental data types of perl and C are quite different. A simple example first:

#!/usr/bin/perl
use strict;
use Inline 'C';
chomp( my $name = <STDIN> );
my $size = count( $name );
print "Your name is $size letters long\n";
__END__
__C__
#include <string.h>
int count( char *name )
{
    int length = strlen( name );
    return length;
}

This time, we grab a string from STDIN, and pass it to the C function count, which returns the length of the string. We then print this out. Now, if you're hazy on C, you need to realise the following: C has no inbuilt functions for directly manipulating strings: strings are treated as arrays of characters terminated by a null character \0. Furthermore, C is strongly typed i.e. there is no generic 'scalar' like in perl: it needs to know if what you want to store or return is a character, integer, floating point number, double precision float, long integer, etc. This is not a C tutorial, but we'll take this one a bit at a time.

The first thing we do is include the standard C library string (by #include-ing its header file string.h), which defines a function called strlen(), which returns the length of a string (less any trailing \0). There's actually no need to do this #include-ing, as Inline automatically #includes all the standard C libraries (like stdio and string), and all the perl libraries too: the sharp-eyed among you may have noticed the lack of #include <stdio.h> in the hello world script.

Then we define a function called count, which takes a char* argument (we'll explain this in a minute), which it will call name, and returns an integer. All C functions look something like:

RETURN_TYPE function_name( ARG1_TYPE ARG1_NAME, ARG2_TYPE ARG2_NAME, ... )

The RETURN_TYPE can be any of the types mentioned above (int, char, etc.), or void if the function doesn't actually return anything. However, C cannot return an array, and as strings are just arrays of chars in C, it cannot return a string either. For similar reasons, it cannot easily receive a string as an argument. So how can we pass count the string whose length we want to find? The answer is to pass a pointer, which is very similar to passing a reference in Perl. You can't pass several arrays to a subroutine directly in Perl (without their being 'flattened' to a list), so you pass references to them. You can't pass an array/string directly in C, so you pass pointers to them. The pointer is (literally) a number that says where first member of an array lives in memory. So the char* means 'a pointer (*) to an array of characters, i.e. a string'. The first (and only) argument to count() is therefore the pointer char*. The rest of the function is obvious: strlen() takes a character pointer and returns the integer length.

You can easily pass ints, longs, doubles, and char* pointers to and from C subroutines. A file called typemaps (usually in the lib/ExtUtils directory) provides the glue that shows Inline how to convert between C's types and perl's types, in the above case, ensuring that the perl scalar value containing your name gets appropriately converted into a C-style pointer to an array of characters. And this is where we begin to delve into the insides of perl:

Inline and XS allow you to directly manipulate perl's own internal data structures. The most important of these is the pointer to a scalar value (SV), SV*. SV*s are pointers to C structs (a little like Perl's objects), and represent the basic internal data type that perl uses to store scalar variables like $v. An SV* contains the data you stored when you create a scalar like $v. The insides of an SV* can contain a variety of other structures (such as IV*, integer values and PV*, string /pointer values), depending on whether perl thought you wanted to store an integer, a float, a string, etc. Various functions can be used to assign to, manipulate and otherwise torture SV*s, and this is what perl itself does when using $v in numerical ($x=2+$v), boolean (exit if $v) or string ($v.=" percent") context: the data stored in the scalar value is retrieved as doubles or as pointers to arrays of chars, etc: whatever is required by the interpreter. There are also AVs and HVs (no prizes for guessing what these are), themselves composed of SVs. When you passed $name to the count() function earlier, Inline implicitly converted the SV containing "Steve" or whatever into the C 'string' (char*) that the count() function wanted.

However, there's no reason why you shouldn't pass pointers to SVs and torture them as you see fit. The functions you can use to manipulate them are documented in perldoc perlapi. If you replace the C code in the previous example with that below:

int count( SV *name )
{
    int length = strlen( SvPV( name, PL_na ) );
    return length;
}

Nothing changes when you run this: it does exactly the same as the last bit of code, but you are doing the conversion explicitly: the function SvPV is the one perl (and Inline) uses to extract a pointer value (char*) from a scalar value (SV). It returns a C 'string' (char*), and takes two arguments, the first is a pointer to an SV (SV*), here name, the second is a variable into which the length of the string is put: if you don't care about this, the API (application programming interface) provides a convenience junk variable called PL_na, which we use here. In fact, that makes no sense at all, as the length of the string is exactly what we are after! A better idea would be:

int count( SV *name )
{
    int length;
    char *string = SvPV( name, length );
    return length;
}

An even better idea would be to use the SvCUR function, which does exactly what we want (i.e. get the length of the string stored in a SV) without pointlessly returning that char *string:

int count( SV *name )
{
    return SvCUR( name );
}

Stack hackery

There are hundreds of other functions you can use to manipulate SVs, AVs and HVs from within C, all documented in perldoc perlapi. For the next example, we'll look at how to pass and return an indefinitely long list of SVs. To do this, we'll need to become acquainted with the perl Stack, which the the thing perl uses to pass multiple arguments to and from C functions, which are inherently incapable of doing this alone. So, the Stack is the pile of SV*s that perl uses to pass and retrieve values to and from a subroutine. When you call a perl subroutine with arguments ($foo, $bar), the corresponding SV*s for $foo and $bar are pushed onto the Stack. The subroutine then pops them off the Stack as required. In the previous examples, you left it to perl and Inline to pop SV* name off the Stack, and push [the SV* corresponding to] int length onto the Stack. However, Inline provides a number of functions for manipulating the Stack directly:

#!/usr/bin/perl
use strict;
use Inline 'C';
my @numbers = qw( 1 2 3 4 5 6 7 8 9 10 );
my @pairwise_sums = sum( @numbers );
print "The pairwise sums are @pairwise_sums\n";
__END__
__C__
void sum( int num1, ... )
{
    Inline_Stack_Vars;
    int i;
    int j=0;
    int sum[Inline_Stack_Items/2];
    /* Create an array called sum half the size of the Stack */
    for (i=0; i<Inline_Stack_Items; i+=2)
    /*Iterate over the stack two at a time */
    {
        sum[j++]=SvIV(Inline_Stack_Item(i))+SvIV(Inline_Stack_Item(i+1));
        /*
        Each item on the stack is an SV*.
        We use SvIV to extract the integer value from the SV*.
        Then we sum them and dump them in the C array sum.
        */
    }
    Inline_Stack_Reset;
    for (i=0; i<j; i++)
    {
        Inline_Stack_Push(newSViv(sum[i]));
        /*
        Here we iterate over the sum array,
        creating new perl SV*s with the newSViv function.
        Then we push these new SV*s onto the Stack.
        */
    }
    Inline_Stack_Done;
}

(apologies to anyone who thinks my C is rubbish!). Here we have written a C function called sum that takes a list of integers and returns another list of integers that are the pairwise sums of the input list (i.e. 1+2, 3+4, 5+6, etc.). The syntax for receiving a list of arguments is:

RETURN_TYPE function_name ( dummy_type dummy_var, ... )

To receive a variable size list of arguments, we use the ... ellipsis notation. XS requires at least one argument in these cases, so we provide it with a dummy variable int num1, which we never intend to use, and don't. Instead we manipulate the perl Stack directly. The first thing we need to do is initialise the Inline Stack handling functions, we do this with:

Inline_Stack_Vars;

This should be at the top of any function manipulating the Stack, as it defines the following macros:

The function sum itself works by iterating over the Stack, grabbing out a pair of SV*s with Inline_Stack_Item( i ), and using the perl API function SvIV to extract the integer value of the SV*. It then sums these and dumps the result in a C array called sum. Then we iterate over this C array, creating new SV*s using the newSViv function, which creates a perl SV* from a C int. These are then pushed onto the reset Stack, and returned. NB: note that the RETURN_TYPE of a function using Stack manipulation directly should be void, or perl will get terribly confused.

Here are some other perl API functions that may come in handy for manipulating scalar values (I'll leave arrays and hashes for your own edification):

SV *newstring = newSVpvf( "Create a new SV with %s or %s semantics", 
    "printf", "sprintf" );
sv_setpvn( newstring, "Or just overwrite one", 21 );

The sv_setpvn function can modify the string inside a SV: the three arguments are the SV* to torture (here newstring), the string to put into the SV, and the length of the string you're putting into the SV.

The usefulness of embedding C code into Perl may not seem obvious at the moment, but it allows you to include external C libraries (such as your favourite blah library you use all the time in C) and call them directly from Perl. XS and Inline also allow you to write C extensions that may run faster than perl does for simple tasks (e.g. if you wanted to quickly sum all the A, G, T and Cs in a nucleotide string, it might be quicker to use C than to use a foreach( split //, $nucl ){ blah } Perl construct, and the overhead that entails). Finally, the Inline mechanism has even been extended so you can embed C++, Java, Python, Ruby, Awk, BASIC, Tcl and even Perl, the last using the entirely silly Acme::Inline::PERL module. There's now even more than more than one way to do it!

Next…