[R6RS] I/O

Fri Jul 7 14:35:26 EDT 2006

Many thanks for the detailed comments!

A new version of the I/O SRFIs has been checked in.  In addition to
the corrections described below, I've massaged the condition hierarchy
a bit in light of what R6RS will have, and added a `native-eol-style'
procedure to the ports layer.

William D Clinger <will at ccs.neu.edu> writes:

> General comment:  Exposing so much low-level detail
> makes it harder to construct an efficient i/o system.
> To me, this primitive i/o abstraction layer looks
> like an extra layer of pure overhead.

This is a misunderstanding about the primary role of the Primitive I/O
layer.  The Primitive I/O layer is mainly for people implementing
custom data sources (and possibly doing very high-performance I/O,
which is hard with the Ports layer).

Now, an implementor will need to provide `open-reader-input-port' and
`open-reader-output-port'.  However, the proposal does not expose how,
say, ports on files are implemented: These could completely bypass
primitive I/O.  (This was a hot topic on the SRFI mailing list; the
first draft did expose more detail.)

It is possible to design the interface for providing custom data
sources to more closely match the Ports system (and I assume most
Scheme systems include abstractions for this, at least internally),
but this is very difficult to do in a manner that's stable, efficient
and easy to write code to.  Refer to the history of "custom ports" in
PLT Scheme for particularly gruesome examples.

> If a port were defined as a reader or writer plus a
> transcoder, I could see some use to this layer, but
> with the side-effecting semantics for associating
> transcoders with ports, I don't.

I'm not sure I understand the comment: Defining ports this way would
ignore the issue of buffering which is essential to the design of Port
I/O.  I also don't see, even if readers and writers were defined as
"reader/writer plus buffer plus transcoder" how the "side-effecting
semantics" would make it less useful.

> Filenames:  Please define "octet" somewhere.

I've replaced this by byte.  (Both "byte" and "octet" are specified in
the section on bytes objects.)

> Readers and Writers:  "The objects representing I/O
> descriptors are called readers for input and writers
> for output."  That sentence appears to be misleading
> because, if I understand this document correctly,
> the word "descriptor" means something completely
> different (and essentially undefined) throughout
> the rest of the document; it does not mean a reader
> object or a writer object.

Right on.  I've eliminated the use of the word "descriptor" in this
paragraph.

> Readers, (get-position):  "EOFs do not count as
> octets."  Do you envision multiple EOFs?

Yes.

> Specification of make-simple-reader:  The document
> does not explain how a programmer is supposed to
> lay hands on an object that can legitimately be
> passed as the second argument (the descriptor) to
> this procedure.  From that I conclude that this
> procedure has no conceivable use in portable code,
> and does not belong in the R6RS.

I'm obviously failing at describing this clearly, and I need your
help.  Remember that the Primitive I/O layer is for implementors of
custom data sources or sinks.  The descriptor is an optional
communication channel between the operations of a certain kind of
source or sink.  For example, an implementation of a bytes writer
(which is built-in, but it wouldn't have to be) will need to provide
`writer-bytes', given just a writer.  Thus, bytes writers keep the
data that's being accumulated in the descriptor; the descriptor is a
communication channel between `open-bytes-writer' and `writer-bytes.'
None of the other procedures ignorant of what kind of reader/writer
they get touches the descriptor.  Thus, this has nothing to do with
portability.

> It may be that any object whatsoever may be passed
> as the descriptor, inasmuch as the reader's state
> is essentially private to the procedures that are
> passed (read!, available, get-position, set-position!,
> end-position, close).  In that case, make-simple-reader
> has a purpose, but I wonder what purpose is served by
> its descriptor argument.

The problem is that this state is hidden in closures, and more
difficult to make available to auxiliary operations such as
`writer-bytes'.

> Prequisites:  The unspecified value should be specified
> as the value returned by the unspecified procedure.

Done; note that this is purely for the SRFI version.  It won't appear
in the R6RS document.

> Instead of saying "strings are represented as vectors
> of scalar values", which implies that the vector?
> predicate is true of strings, it should say something
> like "strings are analogous to vectors of scalar values".

Done.

> File options:  Instead of saying that file options are
> as in SRFI 79, it should say that file options are a
> subset of a certain set of symbols, as in the current
> draft of the primitive io srfi.

Done.

> Buffer modes:  In addition to none, line, and block,
> shouldn't there be an insouciant mode?

Sure.  Could we pick a different word, though?  I'm reasonably
proficient in English, but I had to look this one up.  (And, looking
at the entry in Roget's, it seems to have negative connotations.)  How
about `dont-care', `no-preference' or `never-mind'?

> The description of buffer-mode should refer to name as a symbol, not
> as an identifier.  (The buffer-mode syntax should recognize the name
> as a symbol, not as the name of a variable.  This matters when
> buffer-mode is used within the scope of a variable whose name looks
> like the symbol that names the mode.)

Done.

> Specification of eol-style:  These forms should
> evaluate to the symbols lf, crlf, and cr.

Done.

> Specification of read-bytes-some:  If this procedure
> is intended to hang when waiting to see whether more
> bytes are forthcoming from its argument, the spec
> should say so.  This applies to several subsequent
> specifications also.

I've tried to improve this.

> Specification of read-u8:  Please define octet
> somewhere.  

I've replaced "octet" by "byte" pervasively.

> The spec speaks of "the next end of file"; do you envision input
> ports that contain multiple ends of file?

The model is that you have a byte sequence with interleaved
end-of-files which goes on indefinitely.  For a finite data source, it
ends in an infite sequence of end-of-files.  I've tried to describe
this better.

> How is "just past the end
> of file" different from "just before the end of file"?  

It differs in whether the next read-<something> will return this end
of file or whatever comes after it.  (Which may be another end of file
object, or not.  But the difference is observable if a byte comes
after it.)

> These questions apply to several subsequent specifications as well.
> By the way, what if UTF-8 is inconsistent with the transcoding of
> the input port?

This last sentence I don't understand.  Could you explain?

> Specification of read-string:  The number of bytes
> read appears to be ambiguous, since 0 bytes can
> always be interpreted as a UTF-8 string and many
> bytes that could follow a UTF-8 string might be
> interpreted as an extension of that string.

`Read-string' is really a dumb idea (someone on the SRFI list spotted
it, but I had forgotten about it); it was there for symmetry with
`read-bytes'.  I've elided it.

>
> Specification of read-char:  This also seems
> ambiguous in the sense that the character #\a
> might be followed by modifiers that could be
> composed with #\a to form a new character.  I
> presume the intent is that no such compositions
> be formed.

No.  The prefix of the byte sequence forms an encoding of a scalar
value, and it's unambiguous when that sequence ends.

> A similar remark applies to the next two procedures.
>
> Specification of port-eof?:  What if the port is
> currently pointing *past* an end of file (whatever
> that means)?

If there's a byte there, it will return #f.  If there's another end of
file, it will return #t.

> Specification of input-port-position:  The term
> "transcoded port" has not been defined prior to
> its mention in this spec.

Done that.

> Ditto for "truncated stream" and "translated stream".

Leftovers from a previous version; elided.

> Specification of set-input-port-position!:  Ditto
> the above, plus "terminated stream", which I assume
> is something like a closed port.

Yes, but not quite. Elided.

> transcode-input-port!:  I don't like the side
> effect on the port.  I assume the intention is
> to prevent non-UTF-8 data from being written to
> a UTF-8 port.  

No.  The intention is to support reading data streams with unknown
encodings, where the first few bytes denote the encoding.  This is
fairly common with Unicode, with a BOM at the beginning.  (This is
where the concept of a purely "character port" falls down, BTW.)

> Specification of open-bytes-input-port:  The term
> "byte stream" has not been defined.  Ditto for
> open-string-input-port.

Leftover; elided.

> Specification of write-bytes: What if the bytes to be written are
> inconsistent with the transcoder associated with the output port?
> The same question applies to write-u8, write-string-n, write-char,
> et cetera.

I assume you mean the situation where a non-UTF-8 byte sequence is
written.  I've put a paragraph on this in the "Transcoders" section.
(This was specified in the original SRFI, but the relevant section
drifted to the Streams SRFI, I think.)

> set-output-port-buffer-mode!:  Might there be some
> inefficiency associated with requiring every output
> port to support this operation?

I don't think so.

> transcode-output-port!:  See my remarks regarding
> transcode-input-port!.

Actually, the restriction on `transcode-input-port!' isn't necessary
for `transcode-output-port!'.  I've removed it.

> call-with-string-output-port:  Why does this create
> a "bytes writer" instead of a character writer?  If
> it's a bytes writer, programs can write sequences
> of bytes that have no UTF-8 decoding, and the spec
> doesn't say what's supposed to happen in that case.

It does now.

> Specification of open-file-input+output-ports:  The
> period at the end of the first sentence should be
> outside both parentheses.  (Once again, "stream ports"
> is an undefined term.)

Doine.

> Design rationale, Encoding:  The rationale claims to
> avoid the problems that result from "associating an
> encoding with a port", "by specifying that textual
> I/O always uses UTF-8".  I don't follow this at all.
> The proposal includes "predefined codecs for the ISO
> 8859-1, UTF-16LE, UTF-16BE, UTF32-LE, and UTF-32BE
> encodings"

The codecs translate between UTF-8 and the other encodings.

> and provides a side-effecting procedure that associates them with a
> port; furthermore that side effect is allowed only once, which seems
> really ad hoc given that some data may already have been read from
> or written to the port before that side effect is performed.

What you're writing is exactly the reason why it's only supported
once: If the stream is un-transcoded, the buffer position easily
corresponds to a position in the input stream, and it's trivial to do
the transcoding *from that point*.  If it is transcoded, this mapping
isn't easily available.

> Design rationale, display:  According to the most
> recent status report, formatted output is not under
> consideration for R6RS, so something like display
> should remain.

Is the R5RS compatibility library not enough?

-- 
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla