[R6RS] I/O

Tue Jul 11 15:09:38 EDT 2006

Mike wrote:
> And I remain mystified by yours.  You wrote:
> 
> > Specification of make-simple-writer: See my comments above on
> > make-simple-reader.
> 
> So I chose an example about writers to make my point.  Are you saying
> that you don't understand how the descriptor makes the byte-writer
> example work, or that descriptors are OK for writers, but not for
> readers?

I don't understand how the descriptor makes the byte-writer
example work.  Here is what you wrote:

> ....For example, an implementation of a bytes writer
> (which is built-in, but it wouldn't have to be) will need to provide
> `writer-bytes', given just a writer.  Thus, bytes writers keep the
> data that's being accumulated in the descriptor; the descriptor is a
> communication channel between `open-bytes-writer' and `writer-bytes.'

Since this example is supposed to motivate the descriptor
argument of make-simple-writer, I assume you meant to say
that open-bytes-writer could create the writer it returns
by calling make-simple-writer.  (If you were not saying
that, then your example was irrelevant to my point.)  Thus
the write!, get-position, set-position!, end-position, and
close procedures that the open-bytes-writer procedure
passes to the make-simple-writer procedure must close over
the descriptor if they are to access it at all.

You appeared to admit that this is a problem when you wrote:

> The problem is that this state is hidden in closures, and more
> difficult to make available to auxiliary operations such as
> `writer-bytes'.

Now do you understand why I am mystified by your argument?

> > The document does not explain how a programmer is supposed to lay
> > hands on an object that can legitimately be passed as the second
> > argument (the descriptor) to this procedure.
> 
> Neither does the specification for `cons'.  Instead, a contract has to
> exist between the code that creates a pair, and the code that extracts
> its components.  It's the same with the descriptor.

The R5RS specification for cons names its arguments "obj1"
and "obj2", thereby implying that any object may be passed
as either argument.  The name given to the second arguments
of make-simple-reader and make-simple-writer is simply
"descriptor", which implies nothing about what kinds of
objects are acceptable as the second argument.

I wrote "It may be that any object whatsoever may be passed
as the descriptor", but you have neither confirmed nor
denied that possibility.  As your specification stands, it
provides absolutely no guidance to a programmer concerning
what kinds of objects are legal as the second argument to
make-simple-reader or make-simple-writer.

> There are certainly other ways to encode it, but they all come down to
> returning a special value.  Moreover, some platforms (like Unix,
> AFAICS) don't allow you to distinguish between a "terminal" end of
> file and a "temporary" one.  What would you have, say, `peek-byte'
> return when no data is available, but terminal end of file has not
> been reached?

I would guess that you are assuming a model in which some
character, e.g. END OF TRANSMISSION, popularly known as
control-D, is interpreted as an end of file when typed.

I would prefer a model of interactive ports that does not
impose any such interpretation upon legitimate Unicode
characters, and delivers an end of file object only when
the interactive port is closed, e.g. by a procedure such
as close-input-port.  In that model, peek-byte would hang
if no data is available, but would return an eof-object
only if the interactive port is closed.

> > There are, in fact, several standard ways.  Why not
> > support them?
> 
> I have no objection to supporting them, but I object to restricting
> textual I/O to the supported standards.

I would strongly prefer that the R6RS not make any closed
world assumptions.  Without such assumptions, it will be
possible for implementations to add other procedures that
support nonstandard textual i/o of arbitrary weirdness.
In other words, I don't know of anyone who is proposing
any restriction of textual i/o to supported standards.
I, however, propose that R6RS i/o support only supported
standards.

> > The side effect you are proposing will encourage Scheme programmers
> > to create yet more nonstandard, ad hoc approaches to this mostly
> > solved problem.
> 
> Judging from all the work going into XEmacs on figuring out encodings,
> and the products sold by companies like Basis Tech, this is very
> far from being a solved problem.

You are talking about the problem of guessing the encoding
when there is no other information available.  The R6RS
does not need to solve this problem, or even to address it
except by providing a sufficient set of primitives.  Raw
byte i/o is sufficient.  We should provide raw byte i/o,
and we should provide support for the standard Unicode
encodings.  We should not try to provide anything else.

> > C12a When a process interprets a code unit sequence which
> > purports to be in a Unicode character encoding form, it
> > shall treat ill-formed code unit sequences as an error
> > condition, and shall not interpret such sequences as
> > characters.
> 
> The question is how to represent that error.  Raising an exception
> creates all kinds of hairy protocol issues: How to detect where the
> error occurred, allowing the program to skip over it, returning the
> data before it, and so on.  So I believe representing an encoding
> error by a char value is preferable, even if it is inconsistent with a
> literal interpretation of the Unicode standards.  Maybe this ought to
> be configurable upon opening a file, similar to what Python does:
> 
> http://www.jorendorff.com/articles/unicode/python.html
> 
> (Note that Python also uses "usually a question mark" in "replace"
> mode.)
> 
> Gillam's book says that the convention for representing encoding
> errors is U+FFFD (REPLACEMENT CHARACTER) (page 540).
> 
> Java also allows configuring the behavior.
> 
> So how about adding an additional argument (after the transcoder)
> which may be 'ignore, 'raise, 'replace for the three functions?

In my opinion, the simplest and cleanest solution is to
raise a specific continuable exception.  That would catch
the error condition (and this is truly an error condition,
much more so than some of the other things for which we are
insisting an exception be raised), while allowing programs
to install exception handlers that ignore the situation or
replace the situation by whatever character(s) they like.

If the above solution were not adopted, and we were to
insist upon the replacement semantics, the replacement
character should be U+FFFD (REPLACEMENT CHARACTER).  It
should not be an ordinary Unicode character, and should
certainly not be a common ASCII character such as the
question mark.

> > I see several inefficiencies.  At the very outset, the
> > representation of every output port will have to contain
> > space that is adequate for every buffer mode.  Furthermore
> > each output operation will have to check the current mode,
> > even within loops that contain infrequent predicated calls
> > to unknown procedures.  Et cetera.  For some loops, we're
> > probably talking about a factor of 2 in performance.
> 
> If you output byte-by-byte, I can see this making a difference.

Many programs output byte-by-byte.  At some level, most
programs output byte-by-byte.

> However, if you want to do fast I/O, you better do block I/O, where I
> suspect this will be lost in the other overhead.

The more performant Scheme systems have found ways to make
byte-by-byte i/o quite fast.  Even Larceny has a fast version
in an experimental stage; we have been waiting for the R6RS
i/o system to stabilize before we make this fast version the
default.

In other words, your assumption that byte-by-byte i/o must
be intolerably slow is incorrect.

Will