Guts and gore
Perl 5, as the quite venerable version of Perl you've been programming is called, is eventually going to be usurped. For Perl 6 is on its (rather slow) way, and it looks to be very lovely. Before we get onto the changes in the grammar of Perl 6 (capital P for the Perl language), it's worth a quick look under the hood of perl (little p for perl the perl interpreter), to see what actually happens when you type:
print "Hello world.\n";
Perl, like the other so-called scripting languages such as Python, Ruby and PHP (and even Java for that matter), run on what is termed a virtual machine (VM) or interpreter. Programming languages running on a VM are therefore called interpreted languages, and are slightly distinct from properly compiled languages, such as C and C++. Now, if you've ever programmed in C, or have run a Windows application, you may know what a compiled program is. Or in the latter case, maybe not. Let's start at the bottom.
The only thing you computer's CPU understands is how to shift about 0s
and 1s from one place to another. The only way to get the computer to do
this is to feed it machine code. Machine code consists of binary
(001010101011101010…) instructions that the CPU of
your computer can understand. One level above machine code is
assembly code, which can be programmed in assembly 'languages.
Assembly language is a a thin wrapper over the rather indecipherable
machine code, but it is readily boiled down to machine code by a program
called an assembler. Now, programming in assembly language is
the most basic, and fastest way of getting things done (for the machine
at least), but not exactly the easiest. Here is the hello world script
for the x86 CPU:
title Hello World Program (hello.asm) ; This program displays "Hello, World!" dosseg .model small .stack 100h .data hello_message db 'Hello, World!',0dh,0ah,'$' .code main proc mov ax,@data mov ds,ax mov ah,9 mov dx,offset hello_message int 21h mov ax,4C00h int 21h main endp end main
One level above assembly languages are compiled languages like C or C++. These languages allow you to write code that addresses higher level actions, like reading and writing to files, without having to worry about exactly what this means in terms of pushing 0s and 1s about in the memory, CPU and bus (although you're still programming at a level where you can readily see this if you want). Compiled languages are compiled by (surprise!) a program called a compiler, into assembly code. Say you have this C program:
#include <stdio.h>
int main ( )
{
printf("Hello, world.\n");
return 0;
}
Which does largely the same thing as this perl program (which I've mangled to make it look as similar as possible):
#No need to 'use' STDIN and STDOUT functions, as perl already has them inbuilt
main ();
sub main
{
print("Hello, world.\n");
return( 0 );
}
See the
Hello, world program in n different
languages.
To turn the C code into an executable program, you run it through several stages, largely consisting of turning C into assembler and thence to machine code, with a few extras thrown in for good measure.
The first thing that happens is preprocessing, where
directives such as #include <stdio.h> (which is
vaguely similar to use MODULE; in Perl) are processed. The
result of this is then handed to the compiler, which boils down (lexes
and parses) the C code via a parse tree into the
assembly language for the particular computer you are using. The assembly
language is then assembled by the assembler into object code.
Object code is pretty much just a blob of machine code, but a single
executable program might need more than one blob of object code to work,
if you have included any standard library functions (just like you need a
few modules to run the average Perl script). Finally, several object
files (possibly including some derived from standard libraries, such as
stdio) can be linked together by a linker
to form a big blob of real, executable machine code. You can store these
compiled, assembled, linked lumps of machine code, and give them the
grand title of programs, and indeed this is just what most of
the programs you use on your computer are.
C-code → PREPROCESSOR
(which adds in definitions of library functions,
so the compiler knows what the program has to do) →
Processed C-code → COMPILER
(which boils down the fairly machine-independent C-code to a parse tree
and thence to a machine-specific format) →
Assembly code → ASSEMBLER
(which removes the human friendly wrapping from the assembly code) →
Object code→ LINKER
(which will grab other bits of object code from libraries
indicated by the preprocessor) →
Machine code → CPU
(which you can use to execute the code, or save it as a binary
executable program)
The main problem with compiled languages is that they have to be compiled down to machine code. Although compiled code runs very fast once compiled to machine code, machine code is highly specific to a particular computer architecture (mostly what the CPU and OS are). Hence you have to compile such a program repeatedly, whenever you want the program to work on a different architecture. You'll likely as not also have to build in some specific functionality in your code so as to get your program to work on more than one architecture. So although compiled C programs are nippy, they are also completely unportable.
What interpreted languages do is put a 'skin' (the
interpreter, or virtual machine (VM)), over the top of a computer's
hardware, and you then write programs that run on this 'skin', rather
than on the hardware itself. The interpreter hides the internal detail of
how print "hello, world" is actually achieved on your
computer from you. Your programs still pass through similar stages to a
compiled program, but rather than being compiled to your computer's CPU,
they are compiled to a software simulation of a CPU, the interpreter's
virtual machine.
Now, the interpreter (in particular, we're talking about perl's
interpreter here) is just a program written in C, that has to be compiled
itself when you install it. Hence the writers of perl itself have to
worry about exactly how print is made to work on different
architectures. However, once it's there, you the writer don't
have to worry: a script written on a Linux box will also run on a Windows
XP laptop, and the same program will also run happily on an iMac under
MacOS 8, despite the fact they have utterly different architectures. perl
has been described as a thin skin over the top of any computer that makes
everything look like UNIX.
Now there are a few problems with this. One is that some architectures
(like Windows running over Intel) don't allow certain things to work
properly, like fork()ing,
although there has been progress by implementing fork using threads.
So, although it is largely true that a Perl program will run on
any perl interpreter anywhere, regardless of the actual computer
underneath the hood, there are certain exceptions, most importantly on
Windows and Mac OS< X. Perl programs are therefore largely
'portable'.
The other problem with interpreted languages is they can run quite slowly: the program has to be compiled onto the virtual machine then run every time you invoke it. In contrast, compiled programs only need to be compiled to the real machine once. Java (which runs on the Java VM) gets round this by storing the compiled program as bytecode, which is effectively machine code for the Java virtual machine. However, as well as the compilation phase, there is also some overhead in actually starting up the perl virtual machine in the first place (a problem which Java doesn't g…e…t…ar……o……u………nd in my experience).
However, Perl, Python, Java and other interpreted languages are still popular, since they have many high level functions that would be a serious pain in the arse to program in C or C++ from first principles (I dare you to implement a sorting algorithm in C, or even in Perl without using sort or any array functions. It's not that difficult, but it's not fun either). They also take care of their own memory management: in many compiled languages (e.g. C), you have to worry about ensuring you have enough space to create variables, and if not, you have to ask the operating system for more.
OK, now into the nitty gritty ('what, wasn't that just the nitty gritty?'). Perl programs actually go through similar stages to a compiled program when they are run on the perl interpreter. Remember an interpreter is a virtual machine: you can think of perl as a simulation of a computer running on your computer if you like (which would please Alan Turing immensely).
The first thing that happens to a perl program when you invoke it, is that perl's interpreter lexes and parses your program into an internal format called a parse tree, just as the C-compiler did. For example:
(2+2) *5
will be lexed (broken into recognisable 'tokens') into something like:
( 2 + 2 ) * 5
The parser will then construct a parse tree that looks something like:
/---- 5
--*--| /---- 2
\--+--|
\---- 2
Which I hope makes sense: it tells perl it'll need to add (+) 2 and 2
together, then take that result, and multiply it by the number 5, then
take that result and return it. The syntax tree is then compiled (to the
VM, not to the real machine), to form bytecode, which as I've said, is
basically assembly code for the virtual machine. An optimiser then tarts
the bytecode up, before sending it to the interpreter proper, which
actually executes the code. That's why it's called an interpreter: it
interprets virtual machine code into real executable code that the
computer can understand. [NB. This is strictly a lie. The perl interpeter
actually executes perl opcodes (which are C structs) not
bytecode, but the difference between bytecode and opcodes needn't worry
you too much ☺ ].
The process largely looks like this:
Perl code → PARSER
(which tokenises and parses your code into an internal
format called a parse tree) →
Parse tree → COMPILER
(which converts the parse tree into opcodes
for the perl VM) →
Opcodes → OPTIMISER
(which makes the perl opcodes run faster) →
Optimised Opcodes → INTERPRETER
(which converts the opcodes into machine code
instructions for your hardware) →
Machine code → CPU
(which executes the code)
Various weird things can be done to upset this nice linear flow.
perl's innards are quite incestuous, so eval allows the
interpreter to use the parser to create and run code on the fly. You can
also use the O and B modules to dump the
opcodes as files (this sometimes called perl bytecode), or use source
filters to manipulate how the parsing takes place, etc.
Perl 6
In perl 5, the interpreter, regex engine, parser and so on are incestuously mixed up, and some of the code perl is written in (particularly the regexer) has been optimised to the point of illegibility (apparently, I wouldn't know, as looking at the perl source is known to make people go blind). This has made hacking on perl's internals extremely difficult, and extending perl with the 'XS' language (which allows you to write extensions to perl in C) is an abhorrent nightmare. Hence perl 6. The perl 6 interpreter will no longer be actually called 'perl': it will be called 'Parrot'.
A major aim of the Perl 6/Parrot rewrite is to fix the internal mess of perl 5, modularising it more, so you can mix and match the internal modules:
- You could have a Python parser/compiler that generates a parse tree from Python code, that can then be fed to the Parrot interpreter.
- You could have a compiler that generates Java bytecode rather than Parrot bytecode from a Perl script.
- You could have an interpreter that spits out an executable binary rather than running the code directly.
- You could avoid using the optimiser for quick throwaway scripts.
- You could have a compiler that spits out Parrot bytecode, to save for later, rather than running it directly.
- And so on.
These changes should make extending perl with C or other languages easier, making embedding it in other programs (like mod_perl in a webserver, or a text editor) easier, and also allowing other interpreted languages to run on the same virtual machine, so that you will be able to call Python library functions from Perl and vice versa.
So that's how perl's innards are due to change. What about Perl 6 the language? Well, it'll be pretty much more of the same, only better. To see exactly what's going on, see the Apocalypses (design documents) and Exegeses (explanations) on the perl 6 website. Some of the more important changes so far announced will be:
- Perl's guts will be much more object-oriented, as will things like IO, which is currently rather ugly.
- The introduction of the
%MY::symbol table, in which you can find the current scope's lexical variables. - Easier and separated
moduleandclassconstruction. - Spring cleaning of the punctuation variables:
$!becomes the one true error variable (to subsume$!,$?,$^Eand$@). - Simplifying changes to arrays, hashes and references, so
@array[ $index ]DWIMs, and references automatically dereference themselves in certain situations.$,%and@become invariable, so slices and accesses of arrays and hashes 'look' right. - Better multidimensional arrays and new array torturing operators,
such as those in
List::Util. - Nicer heredocs (indentation is ignored intelligently).
openwill return a filehandle, so you can writemy $fh = open "D:/file.txt";- You can specify attributes/traits on variables: if you're so
minded, you can go the whole bondage and discipline route and type all
your variables:
my int $pi is constant = 3.14; - Turbo-charged subroutine prototypes (so you don't have to start
them all with
my ( $arg1, $arg2 ) = @_; )and multimethods (several subroutines with the same name that get called depending on what the types of the arguments are). - Removal of warts like
system()returning FALSE when it works. ? :becomes?? ::and method calling->becomes.. Concatenating.becomes something else, and the hyperoperator<<blah >>introduced, so you can write@all_multiplied_by_2 = @numbers <<*>> 2;- Define your own operators, even ones with Unicode symbols (so ∑ for sum), and overload or override almost any builtin operator, iterator or function.
localbecomestemp.forloops can take more than one variable at a time, so you can process n arrays m elements at a time.- Hashes become more powerful, and the fat comma
=>becomes a pair constructor. - A real
Switchstatement,given $arg { when EXPR { DO_SOMETHING; } }, with the DWIM comparison operator~~, which will check if a scalar appears in an array, is the key of a hash, or matches a regex, all with the same syntax. - All blocks become closures, and you can define lexical
(
my) subroutines,privateclass methods, and similar such. - The regexer is transformed into something much, much more powerful,
along the lines of
yaccorParse::RecDescent. - And lots more…
Hope that has whetted your appetite for the future of Perl. Keep an eye on perl.com and dev.perl.org for more details.
