[R6RS] I/O

Tue Jul 11 12:33:06 EDT 2006

William D Clinger <will at ccs.neu.edu> writes:

> Mike wrote:
>> More confusion, I'm afraid.  The descriptor is not for communication
>> between a reader and a writer.  It is for communicating between the
>> constructor of a writer and a procedure that accesses some aspect of
>> the writer.  There's `open-bytes-writer' and `writer-bytes,' which
>> communicate.
>
> Please recall that we were discussing the specification
> of make-simple-reader.  When I questioned the rationale
> for the descriptor, you responded by talking about bytes
> writers.  Now you say writers have no need to communicate
> with readers.  Therefore, I conclude, the descriptor is
> not motivated by the example you gave for that purpose.
> I remain mystified by your argument.

And I remain mystified by yours.  You wrote:

> Specification of make-simple-writer: See my comments above on
> make-simple-reader.

So I chose an example about writers to make my point.  Are you saying
that you don't understand how the descriptor makes the byte-writer
example work, or that descriptors are OK for writers, but not for
readers?

In general, I don't follow your argument: 

> The document does not explain how a programmer is supposed to lay
> hands on an object that can legitimately be passed as the second
> argument (the descriptor) to this procedure.

Neither does the specification for `cons'.  Instead, a contract has to
exist between the code that creates a pair, and the code that extracts
its components.  It's the same with the descriptor.

> I don't have any problem with the trailing infinite
> sequence of ends-of-file, but with the idea that data
> must be read following the reading of an end of file.
> I do not agree that this is a natural by-product of
> incremental or interactive data sources.

There are certainly other ways to encode it, but they all come down to
returning a special value.  Moreover, some platforms (like Unix,
AFAICS) don't allow you to distinguish between a "terminal" end of
file and a "temporary" one.  What would you have, say, `peek-byte'
return when no data is available, but terminal end of file has not
been reached?

> There are, in fact, several standard ways.  Why not
> support them?

I have no objection to supporting them, but I object to restricting
textual I/O to the supported standards.

> The side effect you are proposing will encourage Scheme programmers
> to create yet more nonstandard, ad hoc approaches to this mostly
> solved problem.

Judging from all the work going into XEmacs on figuring out encodings,
and the products sold by companies like Basis Tech, this is very 
far from being a solved problem.

> C12a When a process interprets a code unit sequence which
> purports to be in a Unicode character encoding form, it
> shall treat ill-formed code unit sequences as an error
> condition, and shall not interpret such sequences as
> characters.

The question is how to represent that error.  Raising an exception
creates all kinds of hairy protocol issues: How to detect where the
error occurred, allowing the program to skip over it, returning the
data before it, and so on.  So I believe representing an encoding
error by a char value is preferable, even if it is inconsistent with a
literal interpretation of the Unicode standards.  Maybe this ought to
be configurable upon opening a file, similar to what Python does:

http://www.jorendorff.com/articles/unicode/python.html

(Note that Python also uses "usually a question mark" in "replace"
mode.)

Gillam's book says that the convention for representing encoding
errors is U+FFFD (REPLACEMENT CHARACTER) (page 540).

Java also allows configuring the behavior.

So how about adding an additional argument (after the transcoder)
which may be 'ignore, 'raise, 'replace for the three functions?

> I see several inefficiencies.  At the very outset, the
> representation of every output port will have to contain
> space that is adequate for every buffer mode.  Furthermore
> each output operation will have to check the current mode,
> even within loops that contain infrequent predicated calls
> to unknown procedures.  Et cetera.  For some loops, we're
> probably talking about a factor of 2 in performance.

If you output byte-by-byte, I can see this making a difference.
However, if you want to do fast I/O, you better do block I/O, where I
suspect this will be lost in the other overhead.  Generally, most
other I/O APIs have accepted whatever overhead this incurs, including
C.

-- 
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla