Lesson 10

Guts and gore

Perl 5, as the quite venerable version of Perl you've been programming is called, is eventually going to be usurped. For Perl 6 is on its (rather slow) way, and it looks to be very lovely. Before we get onto the changes in the grammar of Perl 6 (capital P for the Perl language), it's worth a quick look under the hood of perl (little p for perl the perl interpreter), to see what actually happens when you type:

print "Hello world.\n";

Perl, like the other so-called scripting languages such as Python, Ruby and PHP (and even Java for that matter), run on what is termed a virtual machine (VM) or interpreter. Programming languages running on a VM are therefore called interpreted languages, and are slightly distinct from properly compiled languages, such as C and C++. Now, if you've ever programmed in C, or have run a Windows application, you may know what a compiled program is. Or in the latter case, maybe not. Let's start at the bottom.

The only thing you computer's CPU understands is how to shift about 0s and 1s from one place to another. The only way to get the computer to do this is to feed it machine code. Machine code consists of binary (001010101011101010…) instructions that the CPU of your computer can understand. One level above machine code is assembly code, which can be programmed in assembly 'languages. Assembly language is a a thin wrapper over the rather indecipherable machine code, but it is readily boiled down to machine code by a program called an assembler. Now, programming in assembly language is the most basic, and fastest way of getting things done (for the machine at least), but not exactly the easiest. Here is the hello world script for the x86 CPU:

title Hello World Program (hello.asm)
; This program displays "Hello, World!"
dosseg
.model small
.stack 100h
.data
hello_message db 'Hello, World!',0dh,0ah,'$'
.code
main proc
mov ax,@data
mov ds,ax
mov ah,9
mov dx,offset hello_message
int 21h
mov ax,4C00h
int 21h
main endp
end main

One level above assembly languages are compiled languages like C or C++. These languages allow you to write code that addresses higher level actions, like reading and writing to files, without having to worry about exactly what this means in terms of pushing 0s and 1s about in the memory, CPU and bus (although you're still programming at a level where you can readily see this if you want). Compiled languages are compiled by (surprise!) a program called a compiler, into assembly code. Say you have this C program:

#include <stdio.h>
int main ( )
{
    printf("Hello, world.\n");
    return 0;
}

Which does largely the same thing as this perl program (which I've mangled to make it look as similar as possible):

#No need to 'use' STDIN and STDOUT functions, as perl already has them inbuilt
main ();
sub main
{
    print("Hello, world.\n");
    return( 0 );
}

See the Hello, world program in n different languages.

To turn the C code into an executable program, you run it through several stages, largely consisting of turning C into assembler and thence to machine code, with a few extras thrown in for good measure.

The first thing that happens is preprocessing, where directives such as #include <stdio.h> (which is vaguely similar to use MODULE; in Perl) are processed. The result of this is then handed to the compiler, which boils down (lexes and parses) the C code via a parse tree into the assembly language for the particular computer you are using. The assembly language is then assembled by the assembler into object code. Object code is pretty much just a blob of machine code, but a single executable program might need more than one blob of object code to work, if you have included any standard library functions (just like you need a few modules to run the average Perl script). Finally, several object files (possibly including some derived from standard libraries, such as stdio) can be linked together by a linker to form a big blob of real, executable machine code. You can store these compiled, assembled, linked lumps of machine code, and give them the grand title of programs, and indeed this is just what most of the programs you use on your computer are.

C-code → PREPROCESSOR
  (which adds in definitions of library functions, 
      so the compiler knows what the program has to do) →
Processed C-code → COMPILER
  (which boils down the fairly machine-independent C-code to a parse tree 
      and thence to a machine-specific format) →
Assembly code → ASSEMBLER
  (which removes the human friendly wrapping from the assembly code) →
Object code→ LINKER
  (which will grab other bits of object code from libraries 
      indicated by the preprocessor) →
Machine code → CPU
  (which you can use to execute the code, or save it as a binary 
      executable program)

The main problem with compiled languages is that they have to be compiled down to machine code. Although compiled code runs very fast once compiled to machine code, machine code is highly specific to a particular computer architecture (mostly what the CPU and OS are). Hence you have to compile such a program repeatedly, whenever you want the program to work on a different architecture. You'll likely as not also have to build in some specific functionality in your code so as to get your program to work on more than one architecture. So although compiled C programs are nippy, they are also completely unportable.

What interpreted languages do is put a 'skin' (the interpreter, or virtual machine (VM)), over the top of a computer's hardware, and you then write programs that run on this 'skin', rather than on the hardware itself. The interpreter hides the internal detail of how print "hello, world" is actually achieved on your computer from you. Your programs still pass through similar stages to a compiled program, but rather than being compiled to your computer's CPU, they are compiled to a software simulation of a CPU, the interpreter's virtual machine.

Now, the interpreter (in particular, we're talking about perl's interpreter here) is just a program written in C, that has to be compiled itself when you install it. Hence the writers of perl itself have to worry about exactly how print is made to work on different architectures. However, once it's there, you the writer don't have to worry: a script written on a Linux box will also run on a Windows XP laptop, and the same program will also run happily on an iMac under MacOS 8, despite the fact they have utterly different architectures. perl has been described as a thin skin over the top of any computer that makes everything look like UNIX.

Now there are a few problems with this. One is that some architectures (like Windows running over Intel) don't allow certain things to work properly, like fork()ing, although there has been progress by implementing fork using threads. So, although it is largely true that a Perl program will run on any perl interpreter anywhere, regardless of the actual computer underneath the hood, there are certain exceptions, most importantly on Windows and Mac OS< X. Perl programs are therefore largely 'portable'.

The other problem with interpreted languages is they can run quite slowly: the program has to be compiled onto the virtual machine then run every time you invoke it. In contrast, compiled programs only need to be compiled to the real machine once. Java (which runs on the Java VM) gets round this by storing the compiled program as bytecode, which is effectively machine code for the Java virtual machine. However, as well as the compilation phase, there is also some overhead in actually starting up the perl virtual machine in the first place (a problem which Java doesn't g…e…t…ar……o……u………nd in my experience).

However, Perl, Python, Java and other interpreted languages are still popular, since they have many high level functions that would be a serious pain in the arse to program in C or C++ from first principles (I dare you to implement a sorting algorithm in C, or even in Perl without using sort or any array functions. It's not that difficult, but it's not fun either). They also take care of their own memory management: in many compiled languages (e.g. C), you have to worry about ensuring you have enough space to create variables, and if not, you have to ask the operating system for more.

OK, now into the nitty gritty ('what, wasn't that just the nitty gritty?'). Perl programs actually go through similar stages to a compiled program when they are run on the perl interpreter. Remember an interpreter is a virtual machine: you can think of perl as a simulation of a computer running on your computer if you like (which would please Alan Turing immensely).

The first thing that happens to a perl program when you invoke it, is that perl's interpreter lexes and parses your program into an internal format called a parse tree, just as the C-compiler did. For example:

(2+2) *5

will be lexed (broken into recognisable 'tokens') into something like:

( 2 + 2 ) * 5

The parser will then construct a parse tree that looks something like:

        /---- 5
  --*--|       /---- 2
        \--+--| 
               \---- 2

Which I hope makes sense: it tells perl it'll need to add (+) 2 and 2 together, then take that result, and multiply it by the number 5, then take that result and return it. The syntax tree is then compiled (to the VM, not to the real machine), to form bytecode, which as I've said, is basically assembly code for the virtual machine. An optimiser then tarts the bytecode up, before sending it to the interpreter proper, which actually executes the code. That's why it's called an interpreter: it interprets virtual machine code into real executable code that the computer can understand. [NB. This is strictly a lie. The perl interpeter actually executes perl opcodes (which are C structs) not bytecode, but the difference between bytecode and opcodes needn't worry you too much ☺ ].

The process largely looks like this:

Perl code → PARSER
  (which tokenises and parses your code into an internal 
    format called a parse tree) →
Parse tree → COMPILER
  (which converts the parse tree into opcodes 
    for the perl VM) →
Opcodes → OPTIMISER
  (which makes the perl opcodes run faster) →
Optimised Opcodes → INTERPRETER
  (which converts the opcodes into machine code 
    instructions for your hardware) →
Machine code → CPU
  (which executes the code)

Various weird things can be done to upset this nice linear flow. perl's innards are quite incestuous, so eval allows the interpreter to use the parser to create and run code on the fly. You can also use the O and B modules to dump the opcodes as files (this sometimes called perl bytecode), or use source filters to manipulate how the parsing takes place, etc.

Perl 6

In perl 5, the interpreter, regex engine, parser and so on are incestuously mixed up, and some of the code perl is written in (particularly the regexer) has been optimised to the point of illegibility (apparently, I wouldn't know, as looking at the perl source is known to make people go blind). This has made hacking on perl's internals extremely difficult, and extending perl with the 'XS' language (which allows you to write extensions to perl in C) is an abhorrent nightmare. Hence perl 6. The perl 6 interpreter will no longer be actually called 'perl': it will be called 'Parrot'.

A major aim of the Perl 6/Parrot rewrite is to fix the internal mess of perl 5, modularising it more, so you can mix and match the internal modules:

These changes should make extending perl with C or other languages easier, making embedding it in other programs (like mod_perl in a webserver, or a text editor) easier, and also allowing other interpreted languages to run on the same virtual machine, so that you will be able to call Python library functions from Perl and vice versa.

So that's how perl's innards are due to change. What about Perl 6 the language? Well, it'll be pretty much more of the same, only better. To see exactly what's going on, see the Apocalypses (design documents) and Exegeses (explanations) on the perl 6 website. Some of the more important changes so far announced will be:

Hope that has whetted your appetite for the future of Perl. Keep an eye on perl.com and dev.perl.org for more details.

Next…