[R6RS] Changing the transcoding mid-stream

William D Clinger will at ccs.neu.edu
Sat Aug 19 06:45:14 EDT 2006


Mike wrote:
> >> I think we're using different definitions of the word "binary I/O".
> >> You seem to mean "untranscoded I/O" whereas I mean "I/O to and from
> >> bytes objects and octets."  Is that a correct interpretation of what
> >> you're asking for?
> >
> > I don't want to answer in the affirmative, because
> > I no longer have any confidence in my understanding
> > of what you mean by transcoding.  You are using that
> > term to include both "compression or SSL or whatever"
> > and Unicode encoding schemes, which to me are radically
> > different things.
> 
> They are not to me.  The SRFI spells out its notion of transcoding
> under "Encoding" in the "Design rationale" section.  Specifically:
> 
> >> This SRFI avoids this problem by specifying that textual I/O always
> >> uses UTF-8. This means that, if the target or source of an I/O port
> >> is to use a different encoding, a translated port needs to be used,
> >> for which this SRFI offers the required facilities. This means that
> >> text decoders or encoders are expressed as binary-to-binary
> >> mappings, and as such compose.

The word "textual" had suggested to me that read-u8 and
friends do not use UTF-8.  I am very sorry to have discovered
my error at this late date.

> Moreover, the second sentence in the section on "Text Transcoders" is:
> 
> >> A transcoder is an opaque object encapsulating a specific
> >> translation from byte sequences to byte sequences.

Note, however, that the section is called "Text Transcoders",
and the sentence preceding the one you quoted begins with the
words "Text transcoders".  The second paragraph talks about
"a text encoder/decoder", saying "these transcoders all
represent specific text encodings.  That paragraph expands
upon the end-of-line convention issue, and then refers
specifically to "read-char and the various read-string...
procedures" and to "write-char and the various write-string..."
procedures when talking about decoding and encoding the
end-of-line convention, as though read-u8 and other procedures
did not participate in that encoding and decoding.

The third paragraph of that section requires transcoders to
do something weird if they encounter an illegal encoding.
That implies that all of the transcoders, including the UTF-8
transcoder, will interfere with binary i/o.  Since the SRFI
also says that "no codec" corresponds to UTF-8, it follows
that the proposal is useless for what I mean by binary i/o.

> I don't think there's any "most programmers" notion about this as most
> programmers (including myself when I started out designing this)
> haven't really considered the implications of mixed binary and textual
> I/O in a multi-encoding and multi-byte encoding setting.

I don't want to argue with you about what most programmers
believe or have considered.  What concerns me is whether
the proposal can deal with what I personally consider to
be binary and mixed binary/textual i/o.  Since you do not
like my definition of those things, let's just consider
the question of whether this proposal can read and write
WAVE files.

As I noted above, it appears to be flat-out impossible to
read or to write a WAVE file using this proposed i/o system.
The file will undoubtedly contain sequences of bytes that
are illegal in UTF-8 (i.e. what the proposal means by "no
codec") and are illegal in all of the other codecs that are
described by the proposal.  At the very least, we need a
true ("binary") no-codec that omits the checking for illegal
byte sequences required of the UTF-8 codec.

Suppose we had such a binary codec.  While that would make
it possible to read and to write a WAVE file, it would still
be pretty inconvenient.  You couldn't use read-char to read
the text fields because read-char assumes UTF-8.  You would
have to use read-bytes-n (or similar) for the text fields,
and then translate the bytes yourself.

As things stand now, the R6RS does not appear to offer any
facilities that might help with that translation.  In
particular, we have no operations for translating bytes
objects (or subsequences of bytes objects) into strings.
(We have open-bytes-reader, but there is no way to
specify a translation/transcoding for it.)

In short, it appears that the i/o proposal's approach to
encoding leads to serious complications, especially when
dealing with binary data or binary data mixed with textual
data.

> > I think we should either change the proposal so it can
> > support what most programmers mean by mixed binary and
> > textual i/o, or we should change the proposal to support
> > completely separate binary and textual i/o through
> > completely separate sets of i/o procedures, or we should
> > give up on binary i/o for R6RS and eliminate all of the
> > operations that give the misleading impression of
> > performing (what most programmers mean by) binary i/o.
> 
> I don't.

I request that our next conference call devote a large
chunk of time to debating what to do about R6RS i/o.

Will



More information about the R6RS mailing list