Friday, May 27, 2005

Unix for Perl Programmers


This is an introduction to using the Unix API from Perl.

It concentrates on POSIX compliant system calls, and introduces System V Release 4 Unix. I look at the fundamental concepts of a process and its interaction with the Unix kernel, the difference between library calls and system calls is discussed, and I give an overview of error handling. Later postings will look at API calls in detail
.

I have assumed a degree of familiarity with Perl that can be obtained by attending the QA Perl Programming course.


Unix compatibility

Unix has a long and chequered history, with a number of standards which often conflict. An attempt to standardise application code interfaces into Posix 1 (IEEE Std 1003.1, now the X-Open SUS) has not been universally adopted.
Use defensive programming techniques where possible, for example do not hard-code system limits, but call functions at run-time to find what these are (see later).
As Perl programmers we are protected from many of the differences between operating systems, but we are not immune from them. Most Perl statements are portable, but the lower we delve into the innards of Unix, the less portable we become. The differences are often expressed in C terms, even in the Perl documentation, so a knowledge of that language is very beneficial.
Use the built-in Perl variable $^O to mark blocks of code you know are specific to certain versions of Unix.
In this article I often give example code and function prototypes. Arguments and their types have been described in a generic way where possible, however you should not assume that they exactly match the systems you use, or the version of Perl.


Perl with Unix: call interfaces



The kernel is the heart of a Unix system. Its job is primarily one of resource management; it looks after the physical resources (hardware) of the computer system and shares them amongst all of the processes that wish to use them.
The kernel can be thought of as providing a number of services to programs, such as a file system for data and program storage, management of the RAM in the computer and a means of accessing devices attached to the computer. The details of the actual hardware are hidden from programs by the kernel; it presents a "virtual-machine" interface to the programs. The virtual-machine interface is consistent across almost all versions of Unix, regardless of hardware details.
Programmers will occasionally require the kernel to perform some function for them, such as performing input or output, or starting a new process. Such requests are made through a number of well-known special function calls known as system calls. The set of system calls defines the interface to the kernel for programs and is kept consistent as much as possible across different variants of Unix. Nevertheless, such low level access is by definition dependant on the operating system, and often requires detailed knowledge. To enable portability, and to supply easy-to-use interfaces, a higher level language like C provides its own RunTime Library (RTL). This takes an ISO standard call and converts it into an operating system specific call, and maybe adds other features as well. Other languages and products have their own RTLs, for example an RDBMS will have a RunTime Library to enable embedded SQL to run.


Perl is written in C, so uses the C RTL, and many of the built-in function calls reflect that. Sometimes these are too general, so we have available lower-level interfaces to make direct calls to kernel APIs, but remember that these were designed to be called from C, not Perl. Perl modules often make use of low-level calls, although they sometimes "cheat" by using imbedded C.


System Calls and Library Functions


Most Unix programs will need to make use of one or more library functions or system calls. While they may both appear to be the same from the program’s point of view, there are some important differences between library functions and system calls.

System calls are the means by which a process can request services from the kernel. Services include accessing and manipulating files and devices, and process-related operations. System calls are documented in section 2 of the Unix reference manual, and are platform specific. Some Perl built-ins make system calls, and appear to be the same as the kernel call. The read() function is an example. The Perl version will make a call to the C library function fread(3) eventually, but along the way adds value. It can handle utf8 files, for example. If we want the kernel function read(2) we need to call the Perl function sysread (there is also a syswrite and a sysopen). Here are some examples:

Perl Kernel call Library Function

$$ getpid
fork fork
open fopen
read fread
sysopen open
sysread read


Library functions are simply C functions written by someone else and supplied via libraries of functions. These libraries are linked to perl either statically (before the program runs) or dynamically (at run time). Using a library function does not necessarily require any help from the kernel.
C Runtime Library functions are documented in section 3 of the Unix reference manual, but Perl does not necessarily use them. The Library functions themselves can have portability problems, so Perl often implements its own. See perldoc perlclib.

The major portion of the work carried out by the various bodies that have developed standards for Unix, such as the IEEE (POSIX) and the Open Group (X/Open Portability Guide, Unix 1170, etc.) have been to standardise the names, arguments and results of library functions and system calls. The result is that it the various versions of Unix that exist today should all contain a compatible set of these, forming a standard API supporting the development of applications that can be ported with minimal efforts.


Perl system interfaces

There are a couple of different methods that may be used to imbed C calls into Perl. The first is the syscall interface. This allows a call to be made direct to a C function, assuming the function's prototype is defined in a header file. The h2ph utility will convert the header file to a Perl .ph file, which can then be loaded with use or require. Unfortunately the h2ph utility cannot cope with all structures and calls, so sometimes the resulting .ph files have to be "tweaked" manually to get them to compile. The most difficult part of using syscall is in converting the variable types between Perl and C, using pack and unpack. A detailed knowledge of C is required for this, as well as an understanding of the function you are about to call.
The XS interface is much easier to use, but still needs an understanding of C. Again we convert the header file, but this time using h2xs. We imbed C code in an XS code block, then build a new module using the XS compiler (called xsubpp). This time the conversion between C and Perl variables is done automatically.
There is a simple XS tutorial available in perldoc, called perlXStut, and the full XS documentation is in perlxs.


Converting between Perl and C

Perl variables are held in an "internal" format, and do not usually have a 1:1 mapping with C primitive types. However, many low-level functions make C system calls, and so we often need to supply a C struct from Perl.

We convert from Perl to C using the built-in function pack:

C scalar = pack TEMPLATE, LIST

where TEMPLATE indicates the type of each field (see below).

To convert from C to Perl we (unsurprisingly) use unpack:

Perl list = unpack TEMPLATE, SCALAR


There are over 30 different template characters, some common ones are:

A A text (ASCII) string, will be space padded.
Z A null terminated (ASCIZ) string, will be null padded.
c A signed char value.
C An unsigned char value. Only does bytes. See U for Unicode.
s A signed short value.
S An unsigned short value (16 bits).
i A signed integer value.
I An unsigned integer value.
l A signed long value.
L An unsigned long value (32 bits).
f A single-precision float in the native format.
d A double-precision float in the native format.
D A long double-precision float in the native format.
x A null byte.

for a full list see perldoc -f pack.

The following example is required to call fcntl for a file lock:

$wlock = pack 'sslli',(F_WRLCK, SEEK_SET, $rec, $rsize, 0);
fctnl( ... ) # Some C call

Unpack the returned flock structure:

($locked, $pid) = (unpack 'sslli', $qlock)[2,4];

The above examples show an interesting portability issue, they do not work on Linux 2.4. This is because the Linux implementation of the perl fcntl function does not call fcntl(2), but a lower level function, fcntl64. This means that the normal 'l', which is 32-bits, has to be doubled up:

my $wlock = pack 'sslllli', (F_WRLCK, SEEK_SET, 0,
$rec, 0, $rsize, 0);

(It took me ages to figure out that Linux perl uses fcntl64, I eventually ran strace(1) on my perl script to discover why it would not work).


POSIX module

The POSIX module is an important tool for Perl programmers on Unix. The interfaces it provides are mostly portable, although you must still be aware of the limitations of your particular system. The POSIX 1 standard (there are actually 13 parts to POSIX) defines a C language interface, and so some functions are not applicable to Perl. Perl abstracts messy tasks like dynamic memory allocation, so strictly speaking the POSIX module does not, and cannot, provide true POSIX compliance, only POSIX functionality. Being a C interface does mean that knowledge of C is a definite advantage when understanding the documentation.
Some POSIX functionality is provided by standard built-in Perl functions and variables, and it is generally better to use those rather than the POSIX module versions, since Perl is optimised with them, but there is not a huge difference.
To use the POSIX module effectively requires some C knowledge, but you do not need to be an expert.

Later postings will explore the POSIX module further.


Error Handling


Always check a function call for errors. As a rule, a return value of FALSE (0 or undef) indicates that an error occurred that prevented the system call from completing successfully. The exact details of the return values of the system calls are found on the relevant manual pages. Return values should always be checked, since undetected error conditions may cause serious problems that are more difficult to locate later in a program.

Some low-level functions return a value which could be zero, yet do not wish this to indicate an error. For example the tell function returns the current file position, which could be zero (beginning of file). In this case the text value '0 but true' is returned instead of just zero. This evaluates to TRUE in Boolean context, but 0 in numeric, yet is immune to usual warnings about use of non-numeric text.
By convention error messages should be reported to STDERR, using any of the methods shown below.

print STDERR "Oops: $!\n" # not fatal, not trappable

die "A death: $!" # fatal, but trappable

warn "Will Robinson!" # not fatal, trappable

The warn and die functions output the perl script line number, and both calls may be trapped using signal handling. The usual way to trap die is by using an eval block (exception handling).
If a system call fails, the global variable $! will contain a system-defined error number, in numeric context, and the error text in string context. The variable $! is only set when an error occurs. It is not cleared on a successful system call. It is only safe to rely on this value when a system call has failed.


Errno module

Unix recognises a standard set of error codes, not all system calls can return all errors. Which errors are returned by which functions are described in the manual pages.

Some useful error codes and constants (conditionally exported by Errno.pm):

EPERM Operation not permitted
ENOENT No such file or directory
ESRCH No such process
EIO I/O error
ENOEXEC Exec format error
ECHILD No child processes
EACCES Permission denied
EEXIST File already exists
ENOTDIR Not a directory


Note that some errors such as EAGAIN and EINTR may not be errors but indicate that a system call must be attempted again (EAGAIN because conditions were not right for the call to complete and EINTR because the call was interrupted by some other system activity).
The package Errno defines the symbolic constants that represent system error conditions. It also defines the hash %!, which has a key for each error constant, and its value is TRUE if the error has occurred. Not all error constants are portable, so you should not assume any particular code exists in %!.


Example - Error Handling

use Errno; # No need to import anything

my $file = 'input.txt';

if ( !open (HANDLE, "$file") )
{
if ($!{ENOENT})
{
print STDERR "$file does not exist\n";
...
}
elsif ($!{EACCES})
{
print STDERR "$file: Permission denied\n"
...
}
else
{
die "Unable to open $file: $!";
}
}

This example shows how Errno can be used to handle different error conditions.
If the call failed, then after an error message is printed, the program terminates with die. This is safer than exit, particularly in a module, since it may be trapped in the calling routine.
Errno can also import POSIX codes with:
use Errno ':POSIX';
An alternative to the Errno module is the POSIX module:
use POSIX ':errno_h';
which exports the error number constants, but not %!.


Command Line Arguments

When a program is executed, it is passed a variable-length list of arguments. The shell arranges for these arguments to be the command name and arguments typed at the command line.
The argument list is accessible to the program by the array @ARGV, so the number of arguments passed can be obtained by using the array in scalar context. Unlike C, the first element is the first argument, not the program name (that is available from $0).
There is a system-defined maximum size of the argument list. Normally, it is not allowed to be larger than 5KB or 10KB.


Option Processing


In the past, Unix commands were written by different contributors, often employing slightly different methods of dealing with the command-line arguments of their programs. By convention, Unix program arguments are either options, which are usually single letters preceded by the "-" character, or arguments to the command itself, usually pathnames. In recent versions of Unix, a definition has been developed of a "standard" command interface, which defines a basic format for command lines. New programs should, if at all possible, be written to conform to this format.
The format is fairly flexible, but does impose a few more restrictions:
Options must be a single character.
All options and their associated arguments must appear before the main arguments.
Multiple option characters may be grouped behind a single "-", except where an option requires an argument.
The standard module Getopt::Std offers two routines, getopt and getopts, the later being the more useful. getopts reads the command line argument list array @ARGV, and extracts valid switches into side-effect variables, or into a predefined hash. If an option is to have an associated argument, the option letter must be followed by a colon (:) character.
Valid options are removed from @ARGV. The end of the options is signified by encountering a non-option argument (ie.. one which does not start with a "-") or the "special" option "--", which is also removed from @ARGV.
If getopts encounters an invalid option then it prints a message on the standard error stream.
Further features are available in the extended version Getopts::Long.


Examples - Options and Arguments

These two code fragments show examples of both methods of syntax.

Using global variables

use strict;
use Getopt::Std;

our ($opt_a, $opt_d, $opt_l);

getopts('ad:l');
print "a: $opt_a, d: $opt_d, l: $opt_l\n";
print "Remaining arguments: @ARGV\n\n";


Using a hash

use strict;

use Getopt::Std;
my %options;


getopts('ad:l', \%options);

while (my ($key, $value) = each %options)
{
print "Switch: $key, value: $value\n"
}

print "Remaining arguments: @ARGV\n";


When using globals, with use strict, we must pre-defined them using our. When using a hash we pass a reference.
In both cases the value is 1, TRUE, if the switch was set. When a value is required it will be set in the side effect variable or the hash.


The Environment block

Every process has access to a variable-length list of pointers to strings, known as environment variables.
Environment variables can be set up from the shell with the export or setenv (csh) commands. They are useful for communicating information on a global level to multiple commands, as they can be accessed easily from within programs and shell scripts.
To see the current environment from the shell, use the env command.
The entire environment is made available to Perl through the hash variable %ENV. The keys are the environment variable names, with values.

print "Home directory is ",
exists($ENV{'HOME'})?$ENV{'HOME'}:"unknown", "\n";

TMTOWTDI

if ( exists $ENV{'HOME'} )
{
print "Home directory is $ENV{'HOME'}\n"
}
else
{
print "Home directory is unknown\n"
}


The example retrieves the value of the HOME environment variable (ie.. the pathname of the user’s home directory).
If HOME is set in the environment, its value is returned, otherwise the key will not exist, and a warning will be given. Therefore it may be advisable to check with exists first.
Incidentally, the HOME environment variable is the default directory for the built-in chdir function (like a shell cd command). If HOME is not set, then LOGDIR is used.



Limits

Since Unix is supported on a wide range of platforms, it is reasonable to assume that there will be differences in detail among them. For example some may allow long filenames where others may still impose the old limit of 14characters. Some may support sophisticated signal-driven job control while others may not.
In developing Unix programs that are intended to work across a wide range of Unix implementations, it is desirable to be able to represent such properties in a platform independent manner, perhaps discovering details about a particular system at run time.
POSIX defines a number of symbolic constants, specified in the POSIX module that govern the details of data types, their sizes and max/min values.
Run time values can be checked using one of three functions - sysconf allows general properties such as the number of system clock ticks per second (useful in certain timing calculations), the maximum number of files that a process can hold open at one time, and the maximum number of child processes that a process can create. Properties that are more closely related to files and directories, such as the length of a filename or pathname, are discovered using the system call pathconf or fpathconf.
Many of these “runtime” limits are also mentioned in the POSIX module, but the values here are absolute minimums. For example, a system may set the maximum number of open files per process to be any value greater than _POSIX_OPEN_MAX, usually defined as 16.


POSIX::sysconf

General system limits and properties can be queried using the sysconf function. The argument is an integer symbolic constant representing the value to be queried. The return value is the current setting of the value, or -1.
If the name parameter represents a property that is valid, but not supported, then errno will be clear even though the function returns -1. An invalid name causes the function to return -1 and set errno to EINVAL.

$max_open = sysconf(_SC_OPEN_MAX);

if (!defined $max_open)
{
print STDERR "Cannot determine _SC_OPEN_MAX: $!"
}
else
{
print "Max open files: $max_open\n";
}


To check on values that can be queried using sysconf(), consult the manual page or the include files or .


POSIX::pathconf

Some limits may have different values on different logical filesystems on the same overall Unix system. These are mostly related to the properties of the filesystem, such as the maximum length of a filename, or a pathname.
To query such a value, use the POSIX::pathconf function. It is similar in its operation to sysconf, except that it requires either a pathname or a file descriptor representing an already open file to be passed in addition to the name of the limit being queried. This indicates on which filesystem the limit is being checked.

$max_name_len = pathconf('.', _PC_NAME_MAX);

if (!defined $max_name_len)
{
print STDERR "Cannot determine _PC_NAME_MAX: $!"
}
else
{
print "Max name length: $max_name_len\n";
}


The return values from pathconf are the same as those from sysconf.
fpathconf has the same functionality as pathconf, except is takes as its first argument a file descriptor instead of a file name. File descriptors will be described later.


Dates and Times

Unix systems count time as a number of seconds from a fixed starting point, called the Epoch, that is defined as midnight (00:00:00) on January 1st, 1970.
This is how time is represented internally in the operating system, applications use a number of functions to translate this into the appropriate representation, taking into account timezones and daylight saving time.
To find the number of seconds since epoch, use the time() built-in function. This can be useful for timing operations to the nearest second.
On some systems it may be possible to use an alternative function, gettimeofday(). This function was originally part of the BSD Unix family, it is often useful because it provides a higher resolution of the time measured. To use gettimeofday() you are required to pass in an indication of a timezone as the second argument. It is acceptable to use NULL here, signifying the current timezone.

For measuring time in better granularity than one second, you may use either the Time::HiRes module from CPAN, or if you have gettimeofday(2), you may be able to use the syscall interface of Perl, see the perlfaq8 manpage for details.


Displaying Dates and Times

Although the central time representation in Unix is a count of seconds, there are many ways in which the date and time can be displayed. A collection of functions allow the programmer to convert the date from the seconds count into various formats, including formatted strings and a structured “broken-down time” representation.

The most commonly used function is localtime(), which takes the basic second count and converts it to a string, in the correct timezone and ready for printing: By default the current time is returned, but output from time() can be used as an argument.

The Time module offers interfaces to localtime and gmtime where the time elements are named.



“Broken Down Time” Array

Sometime it is desirable to obtain partial date information, for example the month or year. To do this, it is best to use an alternative representation known as “broken down time”, as returned by localtime in list context.

Index name Purpose Range Notes
0 sec Seconds after minute 0-61 2 "leap seconds"
1 min Minutes after hour 0-59
2 hour Hours 0-23
3 mday Day of the month 1-31
4 mon Month of the year 0-11 July is 6 (not 7)
5 year Years since 1900 2003 is 103
6 wday Days since Sunday 0-6
7 yday Days since 1 Jan 0-365
8 isdst In Daylight Saving Time -1/0/1



The elements in the list allow the programmer to extract any pertinent pieces of date or time information to allow more specialised calculations to be performed. The sec range requires some explanation: normally the upper limit is 59, but up to 2 extra "leap seconds" can sometimes be added.
To translate from the number of seconds since epoch to broken down time, use localtime or gmtime:
@dt = gmtime ( $secs );
@dt = localtime ($secs );
If the argument is omitted the current time is used. The difference is that localtime will return the time correct in the current timezone, gmtime returns the time correct in GMT (basically this is the same as UTC).
Alternatively, the POSIX function strftime allows the programmer to apply printf()-like formatting so that the exact format of the date can be specified when it is printed. Consult the manual page for details of the formatting options - they are considerable…
It is possible to translate from broken down time back to seconds since epoch. To do this use the POSIX function mktime.


Summary

Many standard Unix system APIs can be called from Perl, some are built-in and some come from modules, notably POSIX.

The original APIs were designed to be called from C, so knowledge of that language is useful. If you wish to know more about the C interfaces, come on QA's Unix Programming Course.

I hope to post examples of other Unix API, for example IO, Sockets, and System V IPC.