[R6RS] notes on paths

Fri May 26 09:51:42 EDT 2006

Under Unix, a filesystem path is a sequence of non-NUL bytes. A path is
shown to the user by decoding according to the current locale. (The
decoding may fail, in which case "?" is usually used in place of a byte
that is not part of an encoding sequence.) A user-supplied path (in
terms of "characters") is encoded according to the current locale to
produce a filesystem path.

Under Windows, a filesystem path is a sequence of non-NUL 16-bit
values. A path is shown to the user by UTF-16 decoding (which may fail
due to unpaired surrogates), and a user-supplied path is UTF-16 encoded
to produce a filesystem path.

Strings:

Any sequence of bytes or 16-bit values could be encoded into an R6RS
string, and in that way, strings could be used to encode filesystem
paths. However, a string surely should be interpreted as a
user-supplied path --- since strings will be used for literal paths,
etc. --- so a string should be locale-encoded or UTF-16-encoded to
produce a filesystem path.

Since the string encoding of a filesystem path may lose information,
library functions that produce paths (such as a function that returns
the content of a directory) shouldn't produce string encodings. Also, a
locale-independent representation is useful for programs that retain
paths across locale changes.

 Note: Although Java has a class to represent pathnames, the internal
 representation of a pathname is just a string. Java strings are 16-bit
 sequences, just like Windows filesystem paths, so it works fine there.
 But under Unix, with a typical locale, the directory-list library
 method can produce pathnames that do not actually exist, because
 encoding inserts "?" in place of unencodable parts of the filesystem
 path. Also, a path is locale-specific.

Byte vectors:

Of course, any sequence of bytes or 16-bit values could be encoded into
an R6RS byte vector. Unlike the case for strings, there's much less
reason to make the byte-vector encoding have anything to do with the
locale encoding, so a lossless encoding can be used.

A drawback of having library functions generate byte vectors, though,
is that the byte vectors are not human-readable (unless byte vectors
are by default printed using the current locale's encoding, which seems
like a bad idea for byte vectors in general).

A separate datatype:

By creating a separate datatype for paths, filesystems paths can be
reliably represented without loss of information, and the default
display mechanism for paths can be human-readable (e.g., encode the
bytes according the current locale before displaying).

The above is the chain of reasoning that led to the current design in
PLT Scheme. The conversion to a separate path datatype was painful, but
it's mostly fine now that we've converted. One remaining bit of pain is
writing libraries that accept both strings and path values for
pathnames; this is no problem when a string is needed for a pathname,
but it means that `path->string' conversions are frequently protected
by a `path?' test.

In some cases, a programmer wants to manipulate part of a path without
losing information in the rest of the path, such as replacing a ".tex"
suffix with ".dvi". Converting to a string and back might lose
information in the path before the suffix. To handle these cases, we
provide `path->bytes', and we define the byte-vector encoding in such a
way that, for sensible locales, consecutive ASCII characters in the
locale decoding map to consecutive ASCII values in the byte vector.
That's good enough in practice, because the path manipulations involve
just ASCII characters. Suffix replacement, for example, works on a byte
vector, and it converts the result back with `bytes->path'.

Matthew