Formal comment #88 (defect)

Improve port i/o
Reported by:	William D Clinger

Component:	i/o
Version:	5.91

Section 15.3 of the draft R6RS describes a design for port i/o that
was based on my misunderstanding of the requirements. In particular,
it was designed to allow arbitrary mixing of binary and textual i/o
for a small set of Unicode character encodings, but does not
generalize well to the large set of encodings that are currently in
use.

The real requirements appear to be: 

- Support efficient binary i/o. 

- Support efficient text i/o. 

- Provide a small set of standard transcoders, while allowing
  implementations to provide others, including transcoders with
  arbitrarily weird semantics.

- Support conversion of binary ports into text ports, mainly to
  support use cases such as input from XML files, where the transcoder
  is determined by reading a small prefix of the file.

The first three of those requirements can be satisfied by a more
conventional design.

The fourth requirement can be met by a procedure that accepts a binary
port as argument and returns a text port that consumes bytes from the
binary port while transcoding them into characters.

The rest of this comment suggests a better design, and then describes
some outstanding issues for which I have no strong recommendation at
this time.

* * 

The main ideas of this alternative design are to distinguish binary
from text files, and to forbid compositions of
transcoders. Composition of transcoders is well-defined in a
mathematical sense, but the composition of two transcoders is unlikely
to be useful.

Those ideas run counter to the ideas of SRFI 81, which was a starting
point for section 15.3 of the draft report.

Other aspects of the suggested design include: 

A transcoder is an immutable description (think of it as a factory
method for manufacturing transcoding objects) of some possibly
stateful algorithm for translating sequences of bytes into sequences
of characters and vice versa.

Every transcoder can operate in the input direction (bytes to
characters) or in the output direction (characters to bytes), but the
composition of those directions need not be identity (and often
isn't). (See [issue:bidirectional].)

Transcoders are never composed, so there is no reason to define the
composition of two transcoders.

The standard transcoders are constructed from codecs, eol styles, and handling modes as described in section 15.3 of the draft R6RS. 

The standard codecs of Scheme include: 

latin-1-codec utf-8-codec utf-16-codec utf-32-codec 

That list of standard codecs includes three of the seven Unicode
character encoding schemes, but omits UTF-16BE, UTF-16LE, UTF-32BE,
and UTF-32LE on the grounds that Scheme programmers should be
encouraged to use codecs that use and interpret a byte-order-mark
(BOM) or its absence as specified by the Unicode standard. (See
[issue:BOM].)

Implementations may support other codecs, eol styles, and other kinds
of transcoders. In particular, they may support Unicode character
encoding schemes that interpret a BOM as a ZERO WIDTH NO-BREAK SPACE,
a noncharacter, or as a private use character.

The binary transcoder is a special pseudo-transcoder that is returned
by the binary-transcoder procedure (which would be added to the
procedures described in section 15.3). Every binary transcoder is eqv? 
to every binary transcoder (but not necessarily eq?), and is not eqv? 
to any transcoder that is returned by the make-transcoder
procedure. The transcoder-codec, transcode-eol-style, and
transcoder-error-handling-mode procedures return #f when given a
binary transcoder as their argument.

A binary port is a port whose transcoder is the binary transcoder.

Binary ports are created by passing the binary transcoder to an open-X
procedure, or by calling an open-bytes-X or call-with-bytes-X
procedure with no transcoder argument.

The binary lookahead-X, get-X, and put-X operations (which have "byte"
or "bytes" in their names) operate only on binary ports.

A text port is a port whose transcoder is not the binary transcoder.

Text ports are creating by passing a transcoder other than the binary
transcoder to an open-X procedure, or by calling an open-X procedure
without a transcoder argument (provided the open-X procedure is not
one of those whose standard name contains "bytes").

The textual lookahead-X, get-X, and put-X operations operate only on
text ports. They do not accept a transcoder as an argument.

A new procedure, transcoded-port, takes a binary port and a transcoder
as arguments and returns a new text port whose state is largely that
of the binary port but whose transcoder is the newly specified
transcoder.

To prevent interference between operations on the original binary port
and buffering of transcoded characters on the text port created by
transcoded-port, the original binary port is closed when the derived
text port is created. (Implementation note: the original binary port
can be cloned, the cloned port encapsulated within the derived text
port, and then the original port closed in a special way that doesn't
release resources needed by its clone.)

If no optional transcoder argument is passed to an open-file-X
procedure, then a text port is returned but the transcoder associated
with that text port is not otherwise specified. (See [issue:locale].)

The port-position and set-port-position! procedures are required only
for binary ports that were created by an open-X procedure. (See
[issue:position].)

The open-X procedures may raise an exception if the specified
transcoder is not supported for the kind of port being opened.

To simplify the process of reading individual characters a binary
port, the R6RS should provide something like get-char-from-binary and
lookahead-char-from-binary, which would take a binary port and a
transcoder as arguments. (See [issue:lookahead].)

The various procedures that are associated with bytes and string ports would also change. The changes for those procedures are contingent upon acceptance of the design sketched above, so I will not try to suggest any detailed specification for those procedures in this comment, except to note that transcode-bytes and transcode-string procedures should be provided to simplify translations from bytes to strings and vice versa. 

* * 

Issues: 

[issue:bidirectional] 

Transcoding algorithms are unidirectional (bytes to characters or
characters to bytes), but are usually named in pairs that are
near-inverses of each other.

[issue:BOM] 

While I'm all for encouraging programmers to use the Unicode character
encodings that interpret byte order marks as specified by the Unicode
standard, I worry about documents that implicitly use or explicitly
specify UTF-16LE or UTF-32LE, which cannot be read using the UTF-16 or
UTF-32 codecs. If few documents actually use UTF-16LE or UTF-32LE,
then this is not much of a concern.

[issue:locale] 

Implementations of Scheme will be in a much better position than the
R6RS to guess the transcoding that is appropriate for a text file, so
the R6RS should not insist upon any particular transcoding when none
is specified by the call to an open-X procedure.

[issue:position] 

Asking for the byte position of a complexly transcoded port can be
like asking for the carrier frequency of a spread spectrum signal, and
I am told that some standard encodings do not always align the
encodings of characters upon byte boundaries, so the port-position
operation should be required only for binary ports, if at all.

[issue:lookahead] 

For a more general approach to this problem, see
http://lists.r6rs.org/pipermail/r6rs-discuss/2006-November/000646.html
(See also [issue:readers].)

[issue:readers] 

The readers described in section 15.2 of the draft R6RS might seem
relevant to the problem of providing ports with arbitrary lookahead,
but they can't solve that problem because they aren't ports. It seems
as though the right thing to do may be to eliminate readers and
writers from the report, while folding their functions into ports that
represent arbitrary sources and sinks. That might be too radical for
R6RS, but dropping readers and writers from the R6RS would clear the
way for a more general solution in R7RS.

RESPONSE:

We will revise the report draft along the lines suggested in this comment,
without get-char-from-binary and lookahead-char-from-binary, which were
identified during the the discussion of the comment as both unnecessary
and inconsistent with the theme of the comment.