Thursday, May 04, 2006

Splitting whitespace

The documentation for Perl is quite good usually, but when it comes to perldoc –f split there is so much magic involved that normal English begins to break down. Take the default arguments, for example.

"If EXPR is omitted, splits the $_ string. If PATTERN is also omitted, splits on whitespace (after skipping any leading whitespace)."

Then, later in the documentation:

"As a special case, specifying a PATTERN of space (' ') will split on white space just as "split" with no arguments does. … A "split" with no arguments really does a "split(' ', $_)" internally."

So, how many whitespace characters is that? A single space as a delimiter implies a single space is used for the split, but it actually does 'one or more whitespace':
   $_ = '    This    is    some    text';
@a = split;
$" = '|';
print "@a\n";
Produces
   This|is|some|text
So leading whitespace is ignored, and one or more whitespace is used as a delimiter. There is a (documented) subtle difference with \s+:
   $_ = '    This    is    some    text';
@a = split /\s+/;
$" = '|';
print "@a\n";
Produces:
   |This|is|some|text
Notice that the first element of the resulting list is empty, which was not previously the case.

A few questions arise from this. First, what is this ' ' syntax all about? Don't we need a regular expression match?
   $_ = 'xxxThisxxisxxsomexxtext';
@a = split 'x';
$" = '|';
print "@a\n";
Produces:
    |||This||is||some||te|t
So it does work, except not exactly the same as a single space, it does not match 'one or more' (x+), so the space is magic. To be fair the documentation does say that ' ' is a special case. But the documentation does not show the syntax of a string literal, it specifically shows that an RE delimited with / / is required. Single quotes works with regular expressions, and with multiple characters (without a leading 'm'). But double quotes or other characters do not work unless preceded with 'm'.

Second question. What does whitespace mean? Is ' ' the same as \s in this case? Normally, of course, it is not, but in this case it is! ' ' is very special.